Ch5: Processing Changes to the Description

Now that we have a way for the compute to be described and from that kernel code to be generated and dispatched for execution in proper order, the next step is to provide the capability to the user to be able to change the high-level description easily and in a manner that is understandable and abstract. An obvious option is that the user directly change the high-level description string. But it would be easier and more intuitive for the user to be able to change the compute description string through a more abstract user interface option. So we want to be able to present the user with GUI options that both conveys the current state of the compute’s description, and through which the user can interact and change to the description. Therefore, we first need to be able present a set of GUI controls to the user that are generated from the current HLD, and that convey to the user the current state of the compute. These controls (e.g. selection boxes, button, checkmarks, number fields) not only tell the user what functions, parameters, input/output passes and conditions are currently used in the individual passes of the compute, but also the callback functions of the controls trigger change in the HLD string.

The first basic capability required to acheive this is that to dynamically add GUI controls to a user interface at runtime. We also need to be able to map these GUI controls to specific portions (e.g. passes) of the HLD when the user interacts with them. And for that to be possible, each control needs to be able to internally tell where in the HLD it corresponds to.  All modern GUI languages (e.g. C#, objective-C, Swift, Xamarin) provide some form of mechanism to programmatically add controls to a user interface. But how the control’s member variables are used to self-identify (their position in the HLD) is usually up to the implementation. The callback functions of these controls (which get called upon the user interacting with them) need to use these self-identifications to determine how to parse the current HLD description to get to the portion they correspond to, and modify it such that the user’s change to the control becomes reflected in the HLD.

Specifically, upon presenting the user with GUI controls that describe the current description of the compute, each control has callback functions that are invoked when the user interacts and changes the control’s state. The first thing these callback functions should do is use their self-identifying member variables to determine their corresponding position within the current HLD string. Based on that, it then needs to parse the current HLD and get to its corresponding location in the HLD string, where change is to be made. This is necessary because a single HLD for a compute might have multiple passes, each with their own functions, conditions and parameters. Once the HLD description changes are made, it should be saved as the new current HLD for the compute. The user can then try the new compute by initiating kernel generation and execution on input data. If we had not used an intermediate HLDL, the GUI control callback functions would now have to directly manipulate the kernel code – which would be extremely error-prone and difficult to manage.

The condensed code snippet below shows an example of how in Blurate the combobox control for selecting and indicating the primary function in each pass is dynamically presented in the user interface, as conveyed in the HLD description, and how the callback functions are formed to handle changes to these controls. The code for the full GUI presentation of multi-pass compute is much larger and more complicated, but the code for adding other individual controls follows this template. This is the Windows C# version of the code for the studio version of Blurate, but the Mac Objective-C version also follows an equivalent code pattern. In fact, the Mac version was generated by manually translating the C# code line-by-line to objective-C. The only main difference was in-line callback function definitions, available in C# but not in objective-C (at least I never figured out how to use them). In Blurate this is all implemented in what is called the “Kernel Designer”, which can be opened from the main menu of the studio version of the app (Windows or Mac).

The drawFormula() function first adds Splitter and Panel controls for each pass of the compute. Then it calls constructPass() function on the created Panel, to which it adds the controls for that pass. The constructPass() function adds a Label to the top of the Panel to indicate the Pass number it corresponds to. Below that Label, it adds a Combobox that has all primary functions listed in it, and its selected option is set to the function specified in the formula. The callback function of this combobox is where the HLD formula is modified when the user changes the option in the combobox.

The first task performed in the combobox callback function is to identify which compute pass it corresponds to. In this C# version of the Blurate code, this is done by parsing the Name member variable of the combobox object. From that, the starting and ending position of the description of the corresponding pass can be determined in the current HLD formula string. The current SelectedItem of the combobox is then replaced as the primary function of that pass in the formula, while leaving the other passes unchanged.

Here the GUI controls are very simple standard ones, and their callback functions directly map to individual passes of the HLD. But that doesn’t need to be the case for other applications. More creative controls can be presented, and they don’t need to each map to individual compute kernels – they can represent more abstract multi-pass functionality and options. As an incremental example, a “Symmetric Convolution” or Sobel function option could be presented to the user in Blurate. Each of these passes would then translate to two kernels, rather than leaving it to the user to specify passes that map to individual compute kernels. This can be extended to more complex compute, requiring even more numerous kernels. The point is that the user does not need to necessarily be presented with every obscure option that directly maps to every angle of variability in the HLDL. It’s up to the creative design of your user interface and what functionality you want to provide the user. The only thing that is a “must” is that those controls that are presented to the user need to be able to find where in the HLD of the compute they correspond to when the user interacts with them and triggers their callback functions.

So far we’ve looked at how to process multi-pass compute descriptions when each pass directly maps to an individually compiled and executed kernel, and how to provide the user with easy GUI options to modify the description of those passes. In the next section we’ll look at optimizing compute performance by fusing multiple passes from the HLD description to one compute kernel, reducing the actual number of kernels executed.

Ch4: The Kernel Dispatch Feeder

Now that we’ve generated the kernel code for each compute pass, the next step is to loop through and dispatch them to be executed with appropriate input data buffers. The first pass doesn’t have any other option but to execute on the original input provided. In the case of an application like Blurate, this input is in the form of an image, but in other applications it could be any other form of input data. The output of this first pass can be used as input to passes that immediately follow it in the High Level Description of the computation. An example of such back-to-back feeding layers in computation is the common morphological function of dilation followed by erosion, which is used to remove small artifacts from images. Alternatively the passes following the initial pass can also work on the same original input data and their outputs merged in some later pass. An example of such computation is that performed in a 2D Sobel filter, for edge detection. In this filter, horizontal and vertical differentiation passes are first applied on the same input surface, and then their outputs are merged with a vector addition operation.

What we’ll be looking at in this section is how to loop through the passes of a given HLD of some compute, dispatch their kernel code to the graphics device for execution and feed each with the correct input data (from where ever it might have come from). We’ll refer to the piece of logic that performs this work as the Dispatch Feeder. Code speaks a thousand words, so let start by looking at what the abridged code for the Dispatch Feeder looks like in the Blurate source code.

It first loops through the generated code of each of the passes, compiles them, and stores the executable binaries in an object called _specialkernel. This is done in the CreateKernels() function. Later calling the ExecuteKernel() method with an argument specifying the index of a binary will invoke the API calls to execute the kernel (in the case of Blurate this consists of OpenCL clEnqueueNDRange invocations). Note that there are two versions of the ExecuteKernel() method; one for single-input kernels and the other for dual-input kernels (i.e. ones that have a merge function). Also there is a clFinish command enforced between each kernel invocation, which can be optimized out between kernels that are not dependent.

A caveat to consider in this process is the fact that for performance improvement we may want to cache the compiled passes, to be reused in later iterations of the compute. Certain compute operations may also internally contain multiple iterations of the same compute pass with different input data. Caching compiled compute binaries requires looking up already compiled kernel code based on a unique signature of each pass (such as a string hash of the kernel code – including any JIT parameters defined). Since compile time can be significant in certain environments, it could be worth pursuing if performance is impacted by compile time. This can bring about a performance tradeoff between baking in JIT parameters, which requires separate builds for alternate JIT parameter values, verses passing in constant values as parameters to the kernel, which introduces its own overheads. Another thing to keep in mind is that some graphics compute APIs may have built-in mechanisms for identifying and reusing recently compiled code – removing the burden from the Dispatch Feeder.

Another time consuming task is the memory allocation needed for storing intermediate output data of the internal compute passes. This can be most noticeable on Android platforms when the memory allocation happens through the Java runtime. Therefore it is important to be able to reuse allocated buffers beyond their initial live scope (i.e. the point where no following compute passes need the data stored in them). For this reason an important part of the kernel dispatch feeder involves identifying and reusing allocated intermediate buffers that are out of scope. In the code above you can see that while sequencing through the kernels, there is a forward scan to find any allocated buffers that aren’t going to be used again. This can quickly be identified from the kernelInputBuffers and secondaryInputBuffers arrays we filled in when parsing the HLD.

If no dead buffer is found, the current pass needs to allocate a new buffer for its output to be stored in. To preserve only those output surfaces that are alive (will be used in later passes), we scan though the remainder of the passes and discard those that are dead. Besides the performance aspect of reusing dead buffers, for applications that deal with large surface data (as is the case with Blurate) this can also be necessary to minimize out of memory situations.

An important note here is on the dimensionality of the input and output buffers. In Blurate, since all modifications happen on the input surface (either the full thing or a specific portion of it), all intermediate surface buffers retain the original input size. Therefore there is no need for explicitly identifying the size of each intermediate buffers. In applications where the dimensionality of the surfaces may change between compute passes, however, the HLDL would require output dimensionality to be specified somewhere in the description of a compute pass. This would either be as a fixed value or as a ratio of the input dimension sizes. The best example of such a compute pass in neural networks would be a Maxpooling function – where the output surface typically has half the size of the input surface (each output data is the “Max” of 4 spatially neighboring data values from the input surface). In these cases preserving the amount of buffer space needed in the new surface also needs to be accounted for. There will be a tradeoff between the performance impact of releasing and reallocating a smaller surface, as opposed to keeping the larger surface allocated and just reusing only the needed portion of it. If alternatively the new required output surface is larger than the surface being freed, then there will be need for either a buffer resizing or freeing and reallocating. Also if multiple buffers can become free at the same time, there might be an opportunity to search for the largest one or combine them to form the new output buffer.

A further aspect to consider here is surface reusability within a single compute pass – i.e. passes that write to their input buffers. Such passes would potentially note require an output buffer to be allocated for them. Most compute APIs do allow for defining buffers and surface as readable, writable or both. In practice reading and writing to the same surface in the same compute pass is either not allowed or not guaranteed to execute in any particular order, unless some form of synchronization barrier is employed. If barriers were to be used for certain compute functions, allowing them to write to their input buffers, it would need to be accounted for in the liveness detection code. Specifically it would need to identify that certain compute passes do not need a buffer allocation for their output. And the owner of the overwritten buffer needs to be changed to the new pass, so it is not incorrectly identified as a dead buffer later on. In Blurate we do consider there to not be kernel code that employs barriers – which simplifies things for the Dispatch Feeder.

In general, the main problem with barriers is that if not used with care, they can cause bizarre performance artifacts across different compute devices, due to fragmentation of resources, depending on how a graphics architecture partitions up its resources to exploit locality. Therefore it is preferable to use “in-out” compute passes (that do not overwrite input buffers), and avoid barriers as much as possible. However, some tasks can benefit substantially from fine-grain serialization between writing to and reading from data buffers. Examples of these are Matrix Multiply and FFT operations, where storing highly reusable portions of data to be shared among work threads can notably improve memory efficiency and reduces stalls on memory access. Adding barrier and shared memory support for such functions into your dynamically generated compute kernels can be done on a per function basis and can, for the most part, be transparent to the HLDL. However, the one place where you need visibility into whether buffers are overwritten in a pass, is here in the Dispatch Feeder. It needs to know what surfaces are being read from and overwritten in the same compute pass, so that it can account for it in determining which buffers are alive and can’t be recycled in later passes.

Ch3: The High Level Description Language

In order to both enable portability across APIs and high-level transformations on the building blocks of a user-defined compute, we first want to generate a high-level description (HLD) of the functions it is made up of, and how they feed each other. Through parsing and processing this HLD we can separate the task of generating final kernel code in the target API from the task of performing high-level transformations.

The question we’ll be addressing in this chapter is how one goes about generating compute kernel code in a portable fashion that is robust enough to allow for high-level transformations to be applied. There’s nothing stopping you from just jumping right in and writing a bunch of code that stitches strings together to form kernel code that implements what you want the computation to be doing. After all kernel code is nothing but a character string (that the JIT compiler needs to be able to parse). In fact simple string stitching is a pretty good solution if you are dynamically changing a few constants in a compute kernel for a given API. But the problem with that approach is that it can quickly become too complicated to manage, replicate and debug across different platforms, host programming languages and APIs. Adding new features and capabilities to your kernel crafting code will turn into a confusing task of scouring through large amounts of arbitrary code trying to find where to modify or introduce new functionality, fields or conditions into the generated code.

Alternatively you could write a large #define switch-case that just looks up and switches the compiled code from a large set of pre-coded kernels. Replicating the switch-case across platforms should be simple enough. Although in that approach each kernel string would have to individually be ported across APIs, it shouldn’t be too unreasonable of a one-time manual task. But the main problem with this approach is that in order to provide high scalability and robustness, the variations of kernel code available needed can rapidly increase (depending on your application) to unmanageable proportions. If there is a possibility for growth in complexity and diversity of the computation, its better to lay the foundation to manage that in your kernel generation early on.

A trade-off in to keep in mind here with dynamically crafted kernels concerns the commonly-used baking-in of constant parameters. General kernel code implementations often have numerous constant parameters that determine weights or conditions. Even when those parameters are fixed for all instances of use of a kernel, they still need to be looked up during actual computation if not specified as immediate values (just because they’re otherwise unknown at compile time). For this reason it is a common performance optimization to provide such constant parameters as build time JIT parameters even when using static kernel code. An example of such parameters are the tensor dimensions used in convolution layers of deep learning implementations. With dynamically crafted kernels, this concept can be taken even further and become more complicated, because the different potential kernel implementations for a piece of compute can have very different constant parameters.

Consider the weights in a simple 2D image convolution. They don’t change during a give instance of the compute, so instead of writing the kernel such that it needs to constantly lookup these weights up from a constant buffer for each output pixel, they could (depending on the size of convolution weights) be baked into the kernel code as immediate values. This shouldn’t be a problem with small diameter convolutions. But as the diameter increases so will the size of the kernel code. So beyond a certain point you’ll be better off reverting to looking up the weights. But that threshold also depends on whether the convolution is separable. So not only will a separable convolution implementation look very different from a non-separable one, but so will the range at which it is better off providing the weights as immediates. Therefore with dynamic kernel crafting the choice of JIT parameters (#defines) set at compile time can widely depend on choices made in crafting the kernels.

The approach you take to forming the kernel code depends on the specifics of your application. If you know there’s no need to expand available functions in the future, or that there is inherently a limited variability in the types of compute, or if you just want to go with a simple solution for now and change it later if necessary, then you might want to go with a HLD lookup table approach, and skip to the remainder of this chapter. The next chapter gets into how to dispatch the compute passes as described in your HLD. The rest of this chapter focuses on the approach of translating a well-defined HLD into kernel code, rather than using a lookup table or simple string stitching approaches.

The parsing of the HLD language gives structure to the code generation, so that when adding new more complex features and functionality across implementations for different languages and APIs, it is easy to find and modify the same location in the code generation logic. For this purpose, the parsing and translation will consist only of simple string manipulation and splitting functions, which are common across modern languages, so it will be line-by-line portable, as opposed to complex regular expressions that can have different functionality across languages. Your HLD language needs to be able to convey all the forms of compute your application will need to perform in simple yet flexible form. It needs to convey the bare minimum amount of detail so you have the flexibility to change what the compute does in every important way and yet be human-readable and easily parsable for debug purposes and for when changes to the HLD are being made by the user or the requirements specified by the requirements.

The first step towards defining such a HLD language (HLDL) is to simply determining how you want to store the HLD – in binary or string format. Binary storage can have more compact form, but is less readable (which might be desirable in some settings). For the purpose of illustration we will consider a string storage of the HLD in the rest of this discussion because it makes debugging easier by removing an extra level of translation. The next thing you need to determine is a set of delimiter characters or strings (or enum values for binary format) to be used in the HLDL to identify the boundary of the description of the different portions of the compute, its function parameters, conditionals and their parameters. We will refer to each of these as a pass. Also needed are unique identifier names for the functions. We can do that in the flex graph from the previous section. Determine a unique string identifier name (or enum values for binary for binary form) for each of the base functions and write them in quotations inside or near the corresponding function in the graph. These strings and enums will be used inside the HLDL to specify the function to be used in a given pass of the compute.

The next thing that needs to be determined is how each individual pass will be distinguished from others. Compute passes will need to be identifiable in the HLD because the output of any pass could potentially be used as input to later passes (based or their merge functions). Since the most natural way of parsing such a HLD is to sequentially sequence through in order, the compute pass identifier is usually implied by its position in the HLD sequence. This is the approach taken in Blurate’s HLD language. If you feel that explicit identifier names for the compute passes would be of use in your application, then consider a delimiter that will allow it to be easily extracted from a pass description name. A more minor aspect to consider is whether its important to be able to have per-pass comments in the HLD description. In Blurate in order to simplify the parsing, the HLD language supports comments only for a full function description, not on a per-pass basis.

Now that we’ve identified the basic components of our HLDL we can start coding a function that will transform a provided description into executable kernel code. We’re not going to get into the fundamentals of natural languages and state machines here. This HLD language is just something internal to your application. It doesn’t need to cover every possible permutation possibility, just the scenarios your application requires and will be using. Before starting any code I would suggest first writing down a couple simple compute tasks in your HLD language that have sufficient complexity to exercise the different function names, parameters and their delimiters. You’ll use these to create your “hello world” kernel code generation. Figure 2 below shows the high-level description of a 2D Sobel filter in the Blurate’s HLD language, and what each of the characters and fields represent.

The formula description for a 2D sobel filter, in Blurate’s HLDL.

The shortened C# code below illustrates how this HLD is parsed to generate kernel code (in this case OpenCL). It takes an input string fullFormula that is HLD of the compute that needs to be performed. The first task is to split this HLD into its individual compute passes. This is done by using a simple string split function on the pass delimiter.

Splitting the input HLD into a string array with the closing bracket character; ‘]’, enables us to quickly identify how many compute passes there are in the HLD. Then by looping over each element in the compute pass array we can determine the base function, parameters and conditional functions of the pass in order.

In each compute pass, we work on the crntFrmula string, which holds the remaining unparsed portion of the HLD (or ‘formula’) of the current pass. The string KernelString is initialized to an empty string at the beginning of each pass and is used to accumulate what will become the final kernel code of the pass. We first extract the base function name by taking the substring from the opening brackets of the compute pass’s formula until the first colon character, ‘:’. Then through a switch the functionType enumeration is determined. The crntFrmula is then updated by removing the function name from its beginning.

The next step is to determine where the input to each compute pass is coming from. The implementation looks for whether the remainder of the formula starts with a curly bracket. If so, one or more inputs are being specified. If not, it indicates that the input to the compute pass is the original input. In this case there is only one original input surface to the whole compute (the input image to be filtered). The input enumerations of each pass are added to the kernelInputBuffers and secondaryInputBuffers arrays to be used by the kernel dispatcher. The input passes do not directly affect the generated kernel code here, except for adding a comment at the top of each kernel to help with debugging purposes.

Now that we have the kernel code for the different passes, and know how they feed each other, we are ready to provide them to the kernel dispatch feeder to loop through and generate the final computation.


Ch2: Determining the Level of Compute Flexibility

Most of today’s applications that make use of general purpose graphics processing do so in the form of acceleration. That is – they were originally written for a traditional non-streaming processor, but have now been given an extension, where a graphics processor (if available) can be employed instead to accelerate the performance. An example of this in Adobe’s Photoshop, where there is an option to enable graphics acceleration in the processing it performs. What is different about the type of application we’re talking about in this blog is that it is written from the ground up assuming that there is going to be a graphics compute device available and that it has the ability to execute JITed kernel code. Because of this it is much more important to consider from the ground up how the application should be design for this purpose. Essential to this is to consider what exactly the user will be specifying and how that will determine how the compute needed to perform the required task needs to be crafted.

In the Photoshop example, for instance, the user can specify a type of blur filter to be applied on an image, usually with a diameter and weight to the blur function. The application will then have a fixed pre-implemented function that takes the specified diameter of the blur operation and applies it on every pixel of the image. A more user-engaged way of doing this however would be for the user to specify in more detail what exactly that blur operation is supposed to do to each pixel based on its neighboring pixel. For instance, the desired blur operation could be taking the average of a number of neighbors of each given pixel and considering specific conditions and weights on distance, color and/or intensity values, in how to include them in the value that will replace the origin pixel. So the user could specify with much more detail what and how exactly the blur operation should perform.

For the purpose of this discussion, the graphics blur operation happens to be a very good example because of the notion of separability. In image processing, a separable filter is one that can be written as the product of two simpler filters, one applied onto the outcome of the other. In doing so the total amount of compute and data fetching can be substantially reduced. The most commonly employed form of a separable filter is the 2D Gaussian convolution that is split into two liners convolutions, one applied on top of the other. Consider a Gaussian blur with a radius of 3. Performing such a filter without taking advantage of separability requires fetching the color value of 48 neighbors of each pixel – applying weights to them each and summing them up to determine the new pixel value. But when implemented in separable form, only 12 neighbors need to be fetched – 6 horizontal in the first linear pass and then 6 vertical neighbors from the output of the first pass. This is possible because of the symmetry of the Gaussian convolution. Such separability, however, is not numerically equivalent in implementing Gaussian-based anisotropic diffusion – where the weight of neighboring pixels depends on how different their intensity is from the original pixel.

So it’s important to consider the pallet of operations being made available to the user in your application at an early stage of the design of an application that dynamically crafts its compute kernels. In the case of a simple image processing application with the ability to perform asymmetric blur operations, providing the capability for the user to define per pixel weights would be necessary. Then when the user happens to be employing a symmetric filter it could be left up to them to define how the separability is exploited. On the other hand if the application is to perform only aggregates of symmetric blur filters, then hiding the separability aspect of each individual filter from the user can reduce user interface complexity.

Deep learning is another example of an arena for general purpose graphics compute applications. In this field providing the capability for an application to craft its own compute kernels based on user specifications can provide a level of flexibility that will enable the field to expand further and faster. It is however also another example of the importance of considering the high level capabilities provided to the user early in the design of such applications. Not only is the space of neural network levels unlimited in the functionality of the levels, but so is the design space for how these different levels are interconnected. Therefore providing the flexibility to the user to choose the definition of neural network levels and how they are interconnected is undoubtedly desirable. But whether the user should be exposed to free-range flexibility is a design choice with tradeoffs that should be determined early in design – if dynamic compute kernels are to be used. An example of excessive user options here would possibly be per-layer choice of activation function (the non-linearity that enables gradient descent to push the neural network to learn from training data). The tradeoff is between exposing the user to less low level details to allow them to focus on the high level design choice, and providing the flexibility to explore new and unusual methods and techniques. For the purpose of that we are looking at here, this design choice can heavily influence how the code that crafts the compute kernels should be designed.

The reason why I’m stressing so much the importance of accounting for these high level design choices early on in the design of your self-crafting compute kernel application is because, as we’ll see in the following sections, a central part of such applications as presented here is a form of high-level description language (HLDL). It’s through crafting and parsing this HLDL that the final kernel string is generated and provided to the vendor graphics compiler to generate an actual executable shader. This HLDL essentially conveys the highest possible description of the computation that is needed to be performed. It is in structuring this HLDL that it becomes important to have fully fleshed out the level and types of flexibility that are to be provided. Although the HLDL is typically internal to your application and can be modified transparently to the user, making even small modifications to it late in the game when the HLDL is already employed across different APIs and platforms can prove to be time consuming.

An approach that I use for delineating the compute flexibility that will be needed in an application is to first think about what basic building block functions are (or could be) important to provide to the user. Then I break the functions down into three basic types: (a) single-input single-output functions, (b) two-input single-output merge functions and (c) condition functions. More complex (e.g. multi input/out) functions may be the more common or efficient way of decomposing functions in the field that you’re developing an application for. But although using more complex basic building block functions is possible, it will likely complicate the design of your HLDL – making it more difficult to expand it in the future. Also note that this decomposition of the functions does not mean that they will be necessarily executed in this form if that is not the efficient way of doing it. It is just the breakdown presented to the user, and on which the HLDL is based. In the next chapter we will see that in compiling the final compute kernel it’s the HLDL parser that is responsible for extracting, merging and combining the basic building blocks in the most efficient manner to be executed. The idea is that in each Compute Pass we can perform a single-input single-output function on some input (from either another compute pass or external input), the function can be conditionally applied based on the condition function, and then merged with a secondary input (also from either another pass or from external input).

Once the basic functions are determined, I draw out a “flexibility diagram”. At the center of a piece of paper draw a circle labeled “Compute Pass”, and then above that circle draw squares with arrows pointing to the center circle. Each of these top outer squares represents a single-input single-output function that can be performed in a compute pass on data provided as input or from other passes. Add as many square as the single-input single-output basic building block functions you want to provide, and label them each with a function name. In the top right corner of each function square, write down any arguments that function may require. For now don’t worry about whether there should be restrictions on the types of functions that can be used as input to other function. Those will be implemented as constraint checks within the interconnection logic between layers.

Depending on the application, there might be need for the user to specify conditional factors in the computation. If that is the case, in a legend somewhere in the page, list all such conditionals. These conditional functions can be as complex and nuanced as they need to be. They could on a per-element basis determine whether the function should be applied at all, could vary internal parameters, or whether given inputs are to be discarded. Moreover, the condition can be on the input or output of the pass at hand or elements of the original input to the compute. So it is better to pin down their exact functionality right now. If not all conditionals are relevant to all base functions, then in the top left corner of each square function list its applicable conditionals.

Now below the center circle add triangles that represent the merge functions that form two-input single-output passes. These functions determine how the outputs of different functions can be merged, and they are what enable us to decompose more complex multi-input multi-output functions into these smaller ones. Add a tringle for each of these functions labeled with their name, with an arrow pointing away from the center circle. The main point here is just to have a solid understanding of the basic types of functionality that are needed in the application.

As an example, Figure 1 below shows the flexibility diagram used to represents the breakdown of initial functions provided in the Blurate application.

Example flexibility graph, illustrating different primary, merge and conditional functions available to user (from Blurate)

The most complicated primary function here is the conv (convolution) one. It requires specifying kernel weights. Note that there is one conditional function listed. Put simply, it is the condition that the input to or output from the primary function be within a given threshold of either a fixed color or the value in the primary input it will be replacing. The condition can be on absolute difference or the percentage of original pixel value. These conditionals are useful in creating anisotropic diffusion effects. There are some caveats such as the fact that conditionals on the source of a non-convolution function being within a given threshold of the original pixel value will never fail. But for brevity in the description of where the conditionals apply, we’ll assume that they are applicable to all base functions.

Now that we have decomposed all the functionality, we are ready to start designing our HLDL, writing code that will create kernel code for the computation that needs to be performed based on that HLDL, and determining how different portions of such computation can feed each other.

Ch1: General Purpose Compute

General purpose graphics architectures such as OpenCL, Metal, Vulkan, Renderscript and Cuda consist of a host-side application that executes on a traditional processor (the host) and manages the dispatching of highly threaded tasks known as kernels to a processing device that can execute them. These processing devices are most commonly an integrated graphics processor or external graphics card (but can also be emulated with other forms of processing devices), which typically consist of numerous processing elements. The processing elements are commonly both SIMD (single-instruction multiple-data), i.e. multiple work threads share the same instruction counter, and multi-threaded, i.e. they each maintain multiple address counters for different SIMD threads and interleave their execution for better latency coverage.

After performing some form of query to accessing a context or pointer to the compute device’s resources, the host-side application dispatches work to the compute device by providing it with three critical pieces of information: (1) the size and dimensionality of the work to be executed, (2) any input and output values and array addresses it may need, and crucially (3) the executable binary to be executed on each processing element. In addition further information about how the work threads should be grouped and subgrouped can also be provided, or the details of how the grouping happens can be transparent to the host-side application. Work threads in the same group typically share common resources (e.g. caches and local memory) and are controlled by the same synchronization methods (e.g. barriers). And work threads in the same subgroup are typically executed within the same SIMD threads.

Upon receiving a dispatch for a given size and dimensionality of work with given input/output values and array addresses, and the starting memory address of an executable binary, the compute device dispatches threads to its processing elements to commence the execution. The executed binary code of the work threads typically contain instructions that invoke a mechanism (e.g. a lookup from specific memory or i/o addresses) to access the thread’s own location within the total work dispatched. Because each thread executes the same binary code, this location information is crucial, as it determines location specific control flow conditions and where the thread should read from input arrays and write to output arrays. This is a very course description of how general purpose compute APIs generally function. For further details on specific general APIs I would suggest searching for latest blogs and blogs on the subject. What we are focusing on in this blog is the more general formation of the executable binary to be executed.

The executable binary is in one way or another generated from compiling generic code into the binary format that the compute device it is going to be executed on can understand. This pre-compiled code is at run-time either provided in human readable string format or (as in the case of Google’s Renderscript) in intermediate LLVM byte-coded format which reduces on-line compilation time. For the sake of readability here we consider the code to be generated from human readable string code as even intermediate byte-code at some point originates from that. The reason why the executable code typically needs to be compiled on-line is in order to allow programmers to write and distribute their kernel code at a level that is independent of the varying details of the underlying architectures on which it will be executed. In other words, on-line compilation of the kernel allows the programmer to write code that will execute on widely different architectures, with different SIMD widths, different groupings of processing elements, different memory layouts and hierarchies and most importantly different instruction sets.

One note to make here is that compute APIs typically provide mechanisms to reuse compiled binaries. In other words the binary code compiled for the compute device at hand does not need to be regenerated if it is going to be used multiple times. A simple example is in video effects, where you would be compile the kernel once and reused on all the frames of a given video. This is typically achieved by caching the compiled binary for later reuse. Extending upon this binary caching mechanism, applications can also pre-cache compiled binaries for certain common compute devices. For instance, if a software product is designed for use on a certain platform with a specific compute device, bundling precompiled binaries for that compute device can do away with the on-line compilation of the kernel code. In this blog we are talking about how to craft the actual kernel code on-line in the host application, therefore regardless whether the application is targeted for a specific compute device or not, it needs to compile the kernel code on-line. But that doesn’t mean that we can’t cache and reuse the compiled kernels across invocations where necessary. And in chapter 5 we will look at approaches to detect reusable compiled binaries within the portions of a compute task.

Now that we’ve covered some of the essential concepts in general compute, we can start to think about what an application that dynamically creates and executes its own kernel code would look like.



As modern day graphics has advanced, numerous graphics vendors have attempted to design paradigms, languages and application programming interfaces (API) to enable programmers to describe graphics work that is flexible and allows for new visual experiences to be created – ones that were too cumbersome to achieve before. Across these different APIs and platforms, one thing that has uniformly emerged is that, in addition to the need for highly optimized regular fixed-function capabilities, high flexibility is also essential. This has brought about the emergence of general purpose graphics processing in the form of what has become known as General Purpose Compute. As an example, describing the geometry and connections of vertices in 3D graphics, testing the visibility of the polygons that the vertices form and sampling from texture surfaces to fill in the visible polygons is highly regular work in common 3D graphics and better suited for rigid dedicated fixed-function circuitry. But other work, such as emulating ambient occlusion to create more realistic light reflections, requires more flexibility, because the approaches to implementing such computation can widely vary depending on the type of atmosphere and effect the programmer is attempting to create. Therefore, all modern graphics APIs have provided some form of general purpose graphics processing capability. The ubiquity of such general purpose processing capabilities in graphics platforms has brought about an entire field of general purpose Stream Processing that extends beyond graphics-related computation.

An intriguing aspect of such general purpose processing, which is inherited from graphics processing as a whole, and forms the central theme of this blog series, is that of Just-In-Time (JIT) compilation of their code. Because the descriptions of graphics processing needs to be portable across very different graphics processing devices, with completely different architectures and instruction sets (often even from the same vendor), there is the universal need for some sort of compilation of the processing code on the platform it will be executed on. Typically this capability is bundled within the driver of the graphics device, and the API provides for an explicit call for compiling the fixed kernel code. The binary produced from the compilation is then used to “dispatch” work on data provided by the programmer elsewhere in the application (or game). This is a very different processing landscape from the traditional pre-built binaries of the x86 era.

This aspect of general purpose graphics processing was not necessarily intended from inception, but it brings about a largely unexplored arena for applications that are both highly flexible and performant – and need not be graphics oriented at all. Most applications that use general purpose graphics processing capabilities currently use static pre-coded kernels. The objective of this blog is to teach you how to write applications that are designed from the ground up with dynamically crafting and dispatching compute kernels at the center to their of functionality.

This blog series will not teach you about any one general purpose graphics API – such a OpenCL, Metal, Cuda or Vulkan (there are plenty of book for that out there). What it will attempt to show you is how to develop a different breed of application that uses such APIs in a flexible, fast, user-expandable fashion, and is easily portable across the ever-expanding graphics APIs. It achieves this through defining an internal API-agnostic high-level description language that is used to guide the crafting and dispatch of compute kernels targeted at the specific environment it is to be run on. The required compute is performed by sequencing through the HLD and dispatching compute kernels such that they feed each other accordingly. And by manipulating the high level description based on user needs, the application changes the actual code it executes.

For anyone already familiar with GPGPU programming, this may at first appear to be just a matter of string-manipulation to craft kernel code. But as we will explore in the chapters to come, although crafting dynamic kernel code truly is only a matter of string manipulation, anything beyond simplistic kernel code transformations requires a well-structured design. Otherwise, as the dimensions of flexibility increase the complexity of kernel crafting code can quickly become unmanageable. This blog series will introduce you to an approach and techniques for developing kernel crafting software in a manner that is expandable and can be applied to any application that would benefit from dynamic compute kernels.

The techniques that will be presented in this blog series were developed as part of an effort to use dynamic compute kernels in visual computing and deep learning for image processing applications at Advanced Kernels. Early in this endeavor it became clear that in order to be expandable, portable and optimal there was need for a structure to the design. In this blog I’ll describe our experience in this area and give you guidelines for how to avoid the pitfalls we ran into.

Much of the experience expounded throughout the following posts, and the sample code and examples presented, are from a specific application we developed as part of this effort. The application enables highly customizable high dynamic range tone mapping and anisotropic diffusion, by crafting compute kernels from a graphical user interface in which the user defines how basic building block pixel manipulations functions are to feed each other. The name of the application is Blurate and can be down loaded on the Apple and Android app stores. This is a good example that illustrates how this framework can be used to create cross-API (in this case Metal and OpenCL) portable applications. The code can be used as a baseline for developing your own cross-platform kernel crafting applications.