A Digression on Divergence

Most people who do serious GPU work understand that shaders and other GPU programs are executed in a SIMD or “SIMT” fashion using “warps” or “wavefronts” that group together multiple program instances (vertices, fragments, work-items, “threads”). It is also widely understood that when the condition computed for a dynamic ifelse is different across these instances, the warp must execute “both sides.”

Aside:If you weren’t aware of any of that, you might like to look over this presentation from Kayvon Fatahalian before you proceed.

Terminology

Our terminology for GPU programming is kind of a mess, since every hardware vendor and every API/language seems to invent its own nomenclature. In this post, I’ll refer to individual running instances of a program as invocations (you might know these as vertices, fragments, work-items, or “threads”), and the implementation-specific aggregates of these instances that are actually scheduled as warps (you may know these as wavefronts or threads).

Caveat:For those familiar with OpenCL or Compute Shader, a warp is not the same thing as a “work-group” or “thread-group.”

When a the different invocations in a warp disagree on the way to go at a conditional branch, we will say they diverge. This is actually a terrible choice of words, since divergence is already used in CS theory to refer to non-termination, but there aren’t many good alternatives.

The whole topic is closely related to the question of when particular expressions/variables are uniform or varying across a warp.

So Why Does This Matter?

Many people don’t really understand what goes on for such “divergent” branches, beyond the “execute both sides” mantra. The various hardware vendors tend not to talk about it, and the API/language specs are surprisingly mum.

Maybe it doesn’t matter. After all, who cares how it is implemented if it Just Works™, right?

Well, the problem is that ever since we put things like global-memory atomics into our GPU languages, the behavior of divergent branches has been semantically observable. This is perhaps most pronounced if you try to get clever and write a per-pixel mutex, as my colleague Marco Salvi did. The following example is in HLSL, but you can write an equivalent program in GLSL, OpenCL, or CUDA:

// We will use each 'uint' entry in this buffer as a lock
RWStructuredBuffer lockBuffer;

void FeelingClever()
{
    uint pixelID = ...; // compute which lock to use

    // try to acquire per-pixel lock,
    // using comapre-and-swap

    [allow_uav_condition] // need this to appease fxc...
    while (true)
    {
        uint oldValue;
        InterlockedCompareExchange(lockBuffer[pixelID],
                                   0, 1, oldValue);

        // if we successfully swapped 0->1,
        // then we got the lock
        if( oldValue == 0 )
            break;
    }

    DoSomethingWhileHoldingLock();

    // release the lock and give everybody else a shot
    lockBuffer[pixelID] = 0;
}

In a world where every Pixel Shader invocation ran as a truly independent thread, this ought to work. If you try to run code like this on current D3D11 GPUs, though, you will find that it seems to work on some implementations, and breaks on others.

The crux of the problem is that a subset of the shader invocations in warp A might grab their mutex right away, while others fail to grab theirs, forcing the entire warp to spin in the while(true) loop. If the locks that A is waiting for are held by a different warp B, which is also in the same situation, then we have a deadlock.

Now, you might decide that this code just isn’t clever enough (as I initially did, when Marco brought this problem to me). If the warps are stuck spinning in that while(true) loop, then the solution might be as simple as rearranging the code so they never have to leave the loop in the first place:

RWStructuredBuffer lockBuffer;

void FeelingExtraClever()
{
    uint pixelID = ...;
    [allow_uav_condition]
    while (true)
    {
        uint oldValue;
        InterlockedCompareExchange(lockBuffer[pixelID],
                                   0, 1, oldValue);

        // if we successfully swapped 0->1,
        // then we got the lock
        if( oldValue == 0 )
        {
            // rather than break out of the loop,
            // just do the work here
            DoSomethingWhileHoldingLock();

            // release the lock and give
            // everybody else a shot
            lockBuffer[pixelID] = 0;

            // only bail from the loop after
            // we've release the lock
            break;
        }
    }

}

It turns out that this version didn’t work either. I honestly can’t remember which versions failed on which hardware, but nothing we tried worked across the board.

Stepping Back

At this point it is worth stopping to ask: why might we expect one of these two options to work, and not the other? What kind of assumptions are we making about how the HW handles divergent branches? Are those assumptions valid in practice?

It turns out that we naturally make a lot of very strong intuitive assumptions about how GPUs will execute our programs, and most of these assumptions aren’t actually guaranteed by the programming models or the hardware.

Aside:This strikes me as similar to the situation with memory models, where people have a strong tendency to assume sequential consistency, just because it is so natural. Un-learning “common sense” is one of the hardest parts of parallel programming.

What Do We Know?

Let’s start with the stuff that seems to be common across most GPU architectures.

We have our warp, made up of several invocations. Along with that we have an active mask which tells us which of the invocations are currently live. The warp as a whole has a single program counter, from which it reads and executes instructions for all of the invocations in the active mask.

Aside: Some people will insist that GPUs have a distinct program counter for every invocation in the warp. These people are either (A) confused, (B) trying to confuse you, or (C) work with some really odd hardware.

When the warp executes a divergent branch, the set of active invocations divide into two (or more) subsets. The warp must pick one of these subsets for the new active mask, and defer the other(s). It is easiest to think of the deferred invocations as sitting on a “divergence stack,” although the particular implementations may vary a lot between vendors.

What Don’t We Know?

At this point I hope some questions are nagging at you, in particular:

  • When we split the active set at a divergent branch, which invocation(s) get “followed” first, and which deferred?
  • When do the invocations that get deferred get picked up and followed?
  • When do the invocations that have diverged “re-converge”?

Because current GPU programming languages do not specify these behaviors, there is a lot of latitude in how different hardware can answer these questions. I can’t easily comment on what particular vendors do (except where information is public), but I can try to illuminate the breadth of the design space.

A particular hardware platform might support structured control-flow instructions (akin to the control flow operations in the D3D shader bytecode), or unstructured branches (c.f. CUDA’s PTX), or both. Things that might seem obvious in the high-level language code can sometimes be subtle when you are looking at a low-level ISA.

Which Path Gets Followed First?

In general, it shouldn’t matter which path gets followed first at a divergent branch, although my per-pixel mutex example clearly depends on the invocation(s) that obtained their lock being followed before those that failed to obtain theirs. There are, however, some cases where an implementer might clearly want to follow one option over another.

If we have a one-sided if:

if (someVaryingCondition)
{
    // inside
}
// after


it seems obvious that we should follow the invocations that go inside the if before those that are just going to skip over it. Of course, remember that “obvious” can be a dangerous word.

Similarly, for the back-edge on a dowhile loop:

do
{
    // inside
}
while (someVaryingCondition);


it seems obvious to follow the invocations that continue looping first.

Lets look at some possible strategies:

Follow the “branch taken” option first

For a simple conditional branch in the hardware ISA, we could prefer to follow the invocations that take the branch over those that continue normaly. For the loop example above, this does the obvious thing, but not for the if example.

Follow the “not taken” option first

This option would do the expected thing for the if, but go against intuition for the loop. It might also be better for instruction caching, since the instructions immediately after the branch are likely to be in the cache already.

Follow “taken” first for back edges, “not taken” for forward edges

We can switch between the two strategies above based on whether the destination of the branch is before or after the branch instruction itself. This gives the expected “obvious” behavior for both of the examples above (assuming the compiler lays out basic blocks in the same order as the high-level code).

Follow whichever path has fewer active invocations

This may sound like an odd design decision, but it is precisely what is described in AMD’s Southern Islands (GCN) ISA documentation.

If we always follow the path that has fewer active invocations, then we know that every divergent branch reduces the number of active invocations by at least half. As a result, the maximum depth of our divergence stack is logarithmic, rather than linear, in the size of the warp (6 rather than 64 entries, in the case of GCN).

Multi-way (switch) and indirect branches (function pointers) add more complexity, but we already have several viable designs alternatives given only two-way branching.

Where/when do we re-converge?

After a divergent branch, a warp is executing with reduced SIMD efficiency: every instruction only applies to a subset of the invocations. In order to reclaim that efficiency, we would like to re-converge and continue executing with the original active mask.

This isn’t just a performance concern, though. There are various operations that communicate or synchronize between the invocations in a warp: finite-difference derivatives in fragment shaders, synchronization barriers in compute languages, and horizontal SIMD reduction operations (e.g., CUDA’s warp-granularity “voting” operations). These operations typically require control-flow reconvergence for their semantics.

In practice, the “re-convergence point” for any potentially-divergent branch must be selected at compile time by the driver’s low-level compiler. This ensures that when the micro-architecture encounters a divergent branch at runtime, it already knows where it can expect to re-converge, and can arrange any internal data structures accordingly.

Re-converge at the end of an if or loop

We as programmers tend to assume that after a divergent branch, the invocations re-converge as soon as possible: typically at the end of the associated high-level control-flow construct. Note, though, that current GPU programming systems do not give this as a guarantee, no matter how intuitive it might seem.

The OpenGL Shading Language spec defines a notion of uniform and non-uniform control flow, and specifies that uniform control flow resumes at the end of a high-level control-flow construct. This spec language is meant to capture the essence of divergence and re-convergence, but it should be noted that it is only used to define when certain operations (e.g., partial-difference derivatives in a fragment shader) are valid. No stronger guarantees are made.

When using a language with unstructured control flow (goto in OpenCL or CUDA; PTX), high-level control-flow structures may not be discernible. In these cases, re-convergence rules must be defined on an arbitrary control-flow graph (CFG).

Re-converge at the immediate post-dominator

A simple approach that can be applied to an arbitrary CFG (and is thus applicable to assembly-level interfaces, or in the presence of goto) is to identify the immediate post-dominator of the basic block that has the branch.

Aside: For those unfamiliar with the term, it is simplest to think of the immediate post-dominator as the earliest point in the code that control flow must eventually pass through. For a precise definition, please consult any compiler textbook, or Wikipedia.

On paper this sounds like a good approach, but fails to match our intuitive expectations for several cases of structured control flow:

while( X )
{
    if( A )
    {
        if( Y )
            break;
    }
    B;
}

In this case, we expect any invocations that diverged at the branch on A, but did not subsequently break out of the loop, to re-converge before executing B. However, this isn’t guaranteed by immediate-post-dominator reconvergence, since the immediate post dominator of the A branch is at the end of the while loop, and not at the end of the if.

The same basic problem arises for switch statements with fall-through:

switch( A )
{
case 0:
    X;
    // fall through
case 1:
    B;
    break;
default:
    break;
}

Here again, an immediate post-dominator rule tells us to re-converge after the whole switch, and does not tell us to re-converge the case 0 invocations with the case 1 invocations before executing B.

As a reminder, re-convergence isn’t just about performance. If B in these examples included a Compute Shader barrier operation, then failure to re-converge at the right place could cause a deadlock.

Re-converge at “thread frontiers”

This paper describes a refined approach to re-convergence that addresses some of the shortcomings of immediate post-dominator reconvergence. They define “thread frontiers” that capture the intuitive notion of re-converging as soon as possible after a branch. Achieving thread-frontier reconvergence can require adding new conservative control flow edges (forcing the warp to visit basic blocks it would otherwise skip). The paper also describes new hardware requirements to implement this scheme efficiently.

Maximal convergence

The ISPC system, which implements a GPU-style programming model on SIMD CPU cores, defines a notion of “maximal convergence.” Intuitively, this can be though of as re-convergence at each and every sequence point as defined by the C language. If operation X happens before Y from the point of view of one invocation, then X for all invocations appears to happen before Y for any invocation in the warp.

This very strict model of convergence is often what people think of when they talk about “warp-synchronous programming,” or about invocations in a warp executing in “lock-step.” This kind of semantic guarantee can be beneficial to users (it would allow us to write a working per-pixel mutex), but it comes at the cost of flexibility for compiler optimizations. For example, if an implementation wants to emulate 16-wide warps on 8 wide SIMD hardware by doubling up math instructions, it is very restricted in how it can schedule those independent 8-wide operations.

Other than ISPC, I am not aware of any GPU-like programming models that give this strong guarantee. CUDA documentation used to give general advice for programmers attempting to use warp-synchronous approaches, but more recent documentation seems to advise against the practice.

When Are Deferred Paths Followed?

When control flow for a warp arrives at a re-convergence point, it needs to run any invocations for the “other” path(s) that had been deferred before it continues execution past the re-convergence point. The simplest way this might be achieved is by popping those deferred invocations off of the “divergence stack” (or alternative hardware mechanism).

This is one case where it seems like there really is an obvious Right Thing to do, and I’m not aware of a lot of design alternatives. Of course, if you don’t know which path is followed first, or where re-convergence points will be placed, there still isn’t much you can do.

Conclusion

If you find yourself trying to write “clever” GPU code that depends on order of execution or memory operations between parallel invocations within a single warp, please be aware that you are stepping into a minefield of implementation-specific behavior. For those of you targeting fixed hardware platforms, you might be able to find out what your GPU is doing under the hood and exploit this knowledge.

For those of you working on more open platforms, I would invite you to carefully read through the available API and language documentation. As a general rule, if the API/language spec doesn’t explicitly guarantee the behavior you are hoping for, you can bet that some hardware out there violates your assumptions.

I hope I’ve given you a sense of just how deep this semantic rabbit-hole goes.

Advertisements

Improving on Constructors

Constructors, as they appear in mainstream object-oriented languages, have numerous issues. Directly allocating objects with constructors creates coupling, and since most languages cannot abstract over constructors, we must resort to techniques like Factory patterns or Dependency Injection to provide the abstraction.

These issues seem to be well understood (or at least well documented), so I thought I’d bring up a less dangerous but no less annoying issue: when I try to code in a mostly-functional style, the approach to construction in C++, Java and C# forces me to write way too much boilerplate.

For an example, Imagine I am defining some C# classes to represent a simple lambda-calculus AST:


class Exp { ... }
class Abs : Exp { ... }
class App : Exp { ... }
class Var : Exp { ... }

Immutable Objects

I’d like to work with objects that are immutable once constructed. That means that I will not expose their fields, and will expose only “getter” properties. If I’d like each Exp to have some information on its range in the source code (using a value type SourceRange) I might write a canonical Exp as:


public class Exp
{
    public Exp( SourceRange range )
    {
        _range = range;
    }

    public SourceRange Range { get { return _range; } }

    private SourceRange _range;
}

For an immutable class with a single attribute, I’ve had to write a surprising amount of boilerplate. I’ve had to write the type of the attribute (SourceRange) three times, and variations on the name of the attribute (range, Range, _range) six times.

If I were using Scala, though, I could express the original intent quite compactly:


public class Exp( val Range : SourceRange )
{}

This notation defines both a parameter to the default constructor of Exp and a read-only property Range that gives access to the value passed into the constructor.

Derived Classes

So it appears that Scala can eliminate our boilerplate in Exp, but what happens in our derived classes? Starting with a canonical C# encoding again, here is Abs:


public class Abs : Exp
{
    public Abs( SourceRange range,
                string name,
                Exp body )
        : base( range )
    {
        _name = name;
        _body = body;
    }

    public string Name { get { return _name; } }
    public Exp Body { get { return _body; } }

    private string _name;
    private Exp _body;
}

The boilerplate for the new properties is the same as before. What is new, though, is that we are forced to re-state the attributes of the base class in our new constructor. While this seems like a relatively small annoyance at first, we end up having to repeat this boilerplate in each subclass we add. If the base class has a non-trivial number of attributes, this obviously gets proportionally worse.

In this case Scala doesn’t provide a solution to avoid this kind of boilerplate:


public class Abs( range : SourceRange,
                  val name : String,
                  val body : Exp )
    extends Exp(range)
{}

Extending the Base Class

So what’s so bad about this per-subclass boilerplate? The dogmatic answer is that it is a violation of Once and Only Once. A more pragmatic answer arises if we need to alter or extend the base class.

Suppose we decide to add a Type attribute to Exp. This attribute might have a default value (e.g. null), so existing call sites that create expressions do not need to be updated. How much code do we have to edit to achieve this?

Adding the a new field and property to Exp is relatively easy, as is adding a new Exp constructor with an additional parameter. In addition, though, we’d have to update every subclass of Exp to include another constructor with the new parameter.

This is a serious compromise in modularity. If we are creating a class library used by other programmers or other organizations then we may not even have access to all subclasses. This means there are certain edits that we cannot make to the base class.

A Possible Compromise

If we sacrificed the goal of having immutable objects, we could use C# auto-generated properties to avoid the per-subclass boilerplate:


public class Exp
{
    public SourceRange Range { get; set; }
}

public class Abs : Exp
{
    public string Name { get; set; }
    public Exp Body { get; set; }
}

With this approach we would then use the property-based initialization syntax when constructing an instance:


var abs = new Abs{ Range = new SourceRange(...),
                   Name = "x",
                   Body = ... };

Adding a Type property to Exp could then be accomplished without affecting every subclass. Clients who create expressions could freely include the new parameter in their initializer lists.

There are two big downsides to this approach, though. The first is that we have sacrificed the immutability of our objects – every property has both a getter and a setter. The second is that clients can now create uninitialized or partially-initialized objects by forgetting to include any of the “required” attributes in their initializer.

You can decide for yourself whether that is an appropriate solution. I for one find it distasteful, and dislike that newer .NET technologies like WPF and XAML seem to be encouraging this style.

Doing Better

Ideally we’d have a solution that combines the declarative style and guaranteed initialization of the Scala approach with the easy extensibility of the C# automatic-property approach. It turns out that CLOS (the Common Lisp Object System) and its descendent Dylan already use a solution along these lines.

Casting our example into idiomatic Dylan, we would have:


define class <exp>
    constant slot range :: <source-range>, required-init-keyword: range:;
end class;

define class <abs> (<exp>)
    constant slot name :: <string>, required-init-keyword: name:;
    constant slot body :: <exp>, required-init-keyword: body:;
end class;

A user could then create an expression using the standard make function (the Dylan equivalent of the new operator in other languages):


let abs = make(<abs>,
               range: someRange,
               name: "x",
               body: ... );

Because Dylan and CLOS are dynamic languages, failure to provide all required parameters yields a runtime rather than compile-time error. Except for this, however, the Dylan approach provides exactly the combination of benefits described above.

Conclusion

Object initialization is a thorny issue in many modern object-orientated languages. In order to gain the benefits of both safety and extensibility, we should be willing to look at a wide variety of languages for inspiration.

In Defense of OldSpeak

My last post tried to make a case in favor of static typing based on the fact that it allows us to do overload resolution. At the time I hadn’t read this post on Gilad Bracha’s Newspeak blog. In the thread on another post, he summarized the sprit of this essay when he commented that:

“Static type based overloading is a truly bad idea, with no redeeming value whatsoever”

I’m not going to claim that I know language design better than Bracha. I will, however, disagree with this extreme position on overloading. If you haven’t read Bracha’s essay, please do so before you proceed…

Let’s first touch on the examples that Bracha used to illustrate his case. Some of these examples relate to legacy issues in Java, and are thus not inherent to languages with overloading. I’ll happily dismiss them since I don’t have to deal with Java.

The rest involve overloads with the following two properties:

  1. The methods are all defined within a single class
  2. The methods are specialized on types that are related by inheritance

I claim that it is the combination of these two properties that is the crux of the argument. If the types involved are not related by inheritance, the “gotcha” aspect of figuring out which overload will be called goes away. And because the methods are all defined in one class (by one programmer?) we can trivialize the cost associated with renaming one of the overloads, or of planning to avoid the situation altogether.

For this limited case – “(1) and (2)” – I actually buy the argument. Static overloading in this case doesn’t do what you want. But what Bracha neglects to mention is that a pure object-oriented message send doesn’t achieve the desired result either! What you want in this case is dispatch on the run-time types of multiple arguments, aka multiple dispatch, aka multimethods.

There are legitimate concerns with multimethods (which Bracha notes) as expressed in e.g. CLOS and Dylan. There are, however, other approaches that are more suitable for a new language. That is a discussion for another day.

Having ceded the argument in the “(1) and (2)” case, in favor of multimethods, that leaves us with the remaining cases, which Bracha didn’t directly address.

The “(1) but not (2)” case is harmless – there is no chance of ambiguity in dispatch. Multimethods subsume this case for overloading anyway, so I don’t think it is particularly useful to discuss.

The remaining cases must all deal with methods that weren’t defined within a single class. We might also presume, then, that we should consider the possibility that the methods involved were defined by different programmers, working at different organizations.

Suppose programmer A defines their Widget class version 1.0. Programmer B decides to use it as the base class for their SuperWidget. SuperWidget has extended Widget by adding a new message “doSomethingSuper” with semantics that are tied into B’s product.

Unbeknownst to B, though, A has been upgrading Widget for version 1.1 by adding their own “doSomethingSuper” method, with completely different semantics (after all, B doesn’t know about A’s product). If B tries to upgrade to the new version of Widget, then what happens?

In a language like Python or SmallTalk, SuperWidget will accidentally override the base class definition of “doSomethingSuper“. Now clients that try to use a SuperWidget as a Widget 1.1 will fail because SuperWidget responds to a message with an unexpected behavior.

If you try this same scenario out in C# and the Microsoft CLR, you’ll find that previously-compiled versions of SuperWidget keep working with Widget 1.1, and clients that use it as a Widget will have no problems. If you recompile SuperWidget after the upgrade, you will be told that your “doSomethingSuper” method might introduce an ambiguity – you will be forced to decorate it explicitly as either an override of the base-class method, or a new method that just happens to have the same name.

The secret that makes this technique work is – you guessed it – static overload resolution. This is exactly the opposite of Bracha’s claim about static overloading in his essay:

“This means that existing code breaks when you recompile, or does the wrong thing if you don’t”

In this case, however, it is the overloading-free languages which inhibit the modular extensibility of the system, and static overloading that makes it possible for another language to avoid the problem.

Overloading is generally not something we pursue, even when our languages support it. Instead, we simply recognize that it is something that arises inevitably when we develop large software systems that aggregate and extend components developed by other programmers and other organizations. The space of useful identifiers is just too small to avoid all conflicts.

Given this fact, I choose to use tools that recognize the inevitability of name conflicts and give me mechanisms for resolving them.

A simple stance on the static/dynamic typing debate

Given that my only other post so far steps on the toes of a major company, I thought it might be appropriate to make my second post a bit less contentious. Well, that was until Arthur Whitney (creator of the K programming language, among others) came to talk about language design to our research group over at the Stanford Graphics Lab.

I found the talk and ensuing discussion very interesting, in large part because Whitney’s taste in language design is so far from my own. Of course, its hard to argue with results, and so I’ve decided to spend some time trying to understand my own biases about programming langauges.

Static and Dynamic Typing

Most programming-language geeks seem to fall strictly on one side of the static/dynamic divide or the other (those who talk about “optional typing” as a middle road are almost always closet dynamic-typers). Despite the long-standing flame-war passionate debate between the opposing camps, there is one important position that I haven’t seen explained.

Looking at Method Dispatch

Here is a simple question that can be asked for most langauges: given a method call:

obj.m( a )

what factors contribute to the decision of what code gets run?

Note that I’m using the obj.m( a ) syntax purely as a representative. There might be languages where the same thing is expressed as “obj m a“, “m(obj, a)” or any number of other variations. The syntactic details, and the object-oriented styling just make it easier to contrast some of the typical punching bags in the static/dynamic debate.

My own survey of languages leads to the following list of factors that might affect dispatch:

  1. The method name: m
  2. The dynamic run-time value of the object obj
  3. The statically computed type of obj
  4. The dynamic run-time value of the argument: a
  5. The statically computed type of a
  6. The lexical context of the entire expression

Comparing Some Languages

A typical dynamic class-based language like Smalltalk or Python will use only (1) and (2) above to control dispatch. For the purposes of dispatch the object (or its class) is treated like a dictionary mapping keys to values. In this case the key is the method name, typically encoded as a string “m“.

Dynamic languages with multimethod dispatch also take (4) into account. This is built-in functionality in languages like the Common Lisp Object System (CLOS) and Dylan, and Guido himself demonstrated a way of adding them to Python as a decorator.

Many functional programming languages would use only (1) and (6) to resolve a call. In many of these languages functions like m are just ordinary values, and thus function names get looked up in the current lexical environment like any other name. Since a name typically only binds to a single value, there is no need to consult any other information to resolve a call. If the programmer wants to execute different paths depending on dynamic information (2, 4) then they will use pattern matching – an orthogonal mechanism to ordinary dispatch.

Haskell’s type classes allow dispatch in that language to take (3) and (5) into account. A function name in Haskell might belong to a type class, and thus the compiler must search a suitable type-class instance which provides the concrete functions for particular types of operands. The search for type class instances is driven by the lexical environment (6).

A typical static class-based language like C++, Java, C# or Scala will use (1), (2), (3), (5) and (6). Our mainstream static object-oriented languages don’t currently support multimethods, often for historical reasons (c.f. Martin Odersky’s comments (bottom of the page) on why Scala eschews multimethod dispatch). A static object-oriented language with multimethods could take all of the available information (1-6) into account. If you can’t already guess, that is exactly the kind of language I would advocate building.

Disambiguation

It is completely reasonable to ask why any of this matters. Widely-used languages have covered very different parts of this space, with the only constant being that dispatch depends in some way on the name of the operation being invoked (which it would seem pretty difficult to ignore).

Where your choices about how to handle dispatch really start to matter is when you have multiple entities with the same name. I intentionally used the word “entities” to be all-encompassing, but I’ll give a quick example.

Suppose you have a base class Base and a derived class Derived. What happens if both Base and Derived try to declare a method m, for different purposes? What if Base and Derived were developed by different people, or different organizations? What if Derived was originally coded using version 1.0 of Base which didn’t have an m method, and then the programmer tries to upgraded to version 1.1 which adds the new method to Base?

As another example, what if you have two protocols or interfaces IFoo and IBar and some objects need to support both protocols? Are the two protocols allowed to have operations with the same name m, but different semantics?

In a functional-programming setting the question becomes: what if two separately-developed modules or typeclasses try to define operations with the same name – can user code import both?

Let’s look at how we can cope with these situations, depending on what kinds of dispatch our language supports:

Strategy: Statically-Typed Object-Oriented Languages

If our dispatch logic takes the static types of the operands into account, then there isn’t really a problem. In the expression obj.m(a) we can look at the static type of obj. If it is IFoo, then we can call IFoo::m; and if it is IBar, we can call IBar::m. If there is not enough information in the current lexical context to know which operation the user meant, then an error can be signaled because of the ambiguity.

This is important – the compiler can use all the information at its disposal in trying to resolve a potential ambiguity. In particular, this means they can use all the information that is manifest in the code the user is writing – including the types of the user’s variables.  Only when there is no information available that unambiguously indicates the programmer’s intent will the compiler throw up its hands and request that the user be explicit. All of these languages give the programmer some way to explicitly signal their intent.

In contrast, if our language cannot take any of this contextual information (static types, lexical context) into account, then there is no way that can ever determine the programmer’s intent. All such references would be ambiguous, and we need to figure out some way to circumnavigate the problem.

Strategy: “That just doesn’t happen in practice!”

The easiest solution here is to just bury your head in the sand. This might involve making your language resolve all these name clashes by fiat – every class is just a key-to-value mapping, and there can only be one value with a particular key. Alternatively, you might try to protect the user from creating name clashes in the first place – don’t allow a class to inherit multiple methods with the same name, or to re-declare an inherited methods.

In all honesty, these are reasonable approaches for any “little language.” Coming up with a robust solution to name clashes can compromise the simplicity of a language. The user needs to learn the specialized rules that control dispatch, and since every method call is subject to these rules this can make understanding code difficult until fluency is achieved. C++ developers who fully understand “argument-dependent lookup” are few and far between.

But if your “little language” has grown up to the point where it is being used to ship serious software (hello, Python!), then you will definitely face these problems. If the language does not directly address them, then you will end up using convention to avoid the issues.

Strategy: Name Decoration

A simple approach is to decorate the method name m to avoid clashes in the first place. That is, rather than IFoo and IBar both having a method m, we instead have methods IFoo_m and IBar_m. Of course, since the pain is felt by downstream developers, both IFoo and IBar would need to have preemptively decorated their method names to avoid future clashes.

Python actually has a limited form of support for this approach built-in (mentioned briefly in the Python style guide). Field names that begin with a single underscore are automatically prefixed with the name of the enclosing class. Note that this is a very specific case of taking the lexical context into account (number 6 in my earlier list).

I hope it hasn’t escaped your attention that by amending the method name with the name of the enclosing type we have basically replicated the ability of a static language to dispatch based on the static type of the object. This came at the cost of more verbose method names everywhere – not just at the call sites that needed it.

Strategy: Module-Scoped Methods

The approach taken by CLOS and Dylan is to put the methods associated with a class into module scope, rather than nesting them inside the class. This means that even if an obj.m( a ) syntax is supported, it is just sugar for m( obj, a ), and the name m is always resolved by looking it up in the lexical environment.

The benefit of this approach is that the meaning of m can now be tailored to each call site – depending on which definition of m is in scope. The downside, of course, is that if more than one definition is in scope the reference once again becomes ambiguous.

Disambiguating these methods either requires explicitly qualifying the method with the name of a module: FooModule.m( obj, a ), or importing the method into the current scope with a decorated name: Foo_m( obj, a ). Thus we still end up having to explicitly decorate our references with information that the statically-typed language was able to glean from the current context.

Strategy: Avoid Inheritance

One each way to avoid ever making the obj.m( a ) operation ambiguous is to disallow certain forms of inheritance – either in the language, or institutionally. This might involve a blanket ban on implementation inheritance and multiple inheritance, or some more subtle middle ground.

If an object wants to support two different protocols like IFoo and IBar, it could do this by using the Adapter pattern. Two auxiliary objects would be created (one to implement IFoo, the other to implement IBar) and any place where we want to use our object through one of these interfaces, we pass the auxiliary object instead. Thus our original method call might be disambiguated as either obj.asIFoo.m( a ) or obj.asIBar.m( a ).

This form of disambiguation is not unlike inserting a cast to either IFoo or IBar in a statically-typed language (which is a common way of disambiguating a call). Compared to decorating method names it has the nice benefit that we can “cast” an object once, store the adapter object in a variable, and re-use the adapter for multiple calls (without decoration). Of course, this idiom then looks suspiciously like declaring a statically-typed variable.

Conclusion

When we are “programming in the large” and aggregating separately-developed code from multiple suppliers, names clashes seem inevitable. For most mainstream programming languages, it is possible to plan ahead and build libraries that are robust against these issues. Which language you are using, though, determines which strategies you can employ to achieve that robustness.

In many static languages, name-clash problems are anticipated by the language design, and there are explicit features (including method dispatch strategies) that make it relatively painless to cope with. In particular, users typically only have to amend their code in cases where their intent isn’t already clear from context. The price for this is a more complex set of rules in the language (for resolving potential ambiguities).

In all of the mainstream dynamic languages I am aware of, name clashes must instead be handled by diligent and defensive programming practices. In the best case, the idioms that work to avoid name clashes lead to slightly more verbose code. In the worst case, though, they amount to using ad-hoc mechanisms to express precisely the same information that the static types in other languages provide “for free.” The benefit is that for programs or libraries that don’t need this level of defensiveness, programmers benefit from intuitive language semantics.

For my part, I’d rather build (and use) languages that have a robust solution to problems like this built in. Dynamic typing advocates often complain about having to write extra annotations to “appease the compiler,” but that particular saw cuts both ways. If the context of my code makes my intention clear, I want my compiler to use that context to resolve potential ambiguities, rather than making me always add extra annotations to “appease the runtime.”