Ch4: The Kernel Dispatch Feeder

Now that we’ve generated the kernel code for each compute pass, the next step is to loop through and dispatch them to be executed with appropriate input data buffers. The first pass doesn’t have any other option but to execute on the original input provided. In the case of an application like Blurate, this input is in the form of an image, but in other applications it could be any other form of input data. The output of this first pass can be used as input to passes that immediately follow it in the High Level Description of the computation. An example of such back-to-back feeding layers in computation is the common morphological function of dilation followed by erosion, which is used to remove small artifacts from images. Alternatively the passes following the initial pass can also work on the same original input data and their outputs merged in some later pass. An example of such computation is that performed in a 2D Sobel filter, for edge detection. In this filter, horizontal and vertical differentiation passes are first applied on the same input surface, and then their outputs are merged with a vector addition operation.

What we’ll be looking at in this section is how to loop through the passes of a given HLD of some compute, dispatch their kernel code to the graphics device for execution and feed each with the correct input data (from where ever it might have come from). We’ll refer to the piece of logic that performs this work as the Dispatch Feeder. Code speaks a thousand words, so let start by looking at what the abridged code for the Dispatch Feeder looks like in the Blurate source code.

It first loops through the generated code of each of the passes, compiles them, and stores the executable binaries in an object called _specialkernel. This is done in the CreateKernels() function. Later calling the ExecuteKernel() method with an argument specifying the index of a binary will invoke the API calls to execute the kernel (in the case of Blurate this consists of OpenCL clEnqueueNDRange invocations). Note that there are two versions of the ExecuteKernel() method; one for single-input kernels and the other for dual-input kernels (i.e. ones that have a merge function). Also there is a clFinish command enforced between each kernel invocation, which can be optimized out between kernels that are not dependent.

A caveat to consider in this process is the fact that for performance improvement we may want to cache the compiled passes, to be reused in later iterations of the compute. Certain compute operations may also internally contain multiple iterations of the same compute pass with different input data. Caching compiled compute binaries requires looking up already compiled kernel code based on a unique signature of each pass (such as a string hash of the kernel code – including any JIT parameters defined). Since compile time can be significant in certain environments, it could be worth pursuing if performance is impacted by compile time. This can bring about a performance tradeoff between baking in JIT parameters, which requires separate builds for alternate JIT parameter values, verses passing in constant values as parameters to the kernel, which introduces its own overheads. Another thing to keep in mind is that some graphics compute APIs may have built-in mechanisms for identifying and reusing recently compiled code – removing the burden from the Dispatch Feeder.

Another time consuming task is the memory allocation needed for storing intermediate output data of the internal compute passes. This can be most noticeable on Android platforms when the memory allocation happens through the Java runtime. Therefore it is important to be able to reuse allocated buffers beyond their initial live scope (i.e. the point where no following compute passes need the data stored in them). For this reason an important part of the kernel dispatch feeder involves identifying and reusing allocated intermediate buffers that are out of scope. In the code above you can see that while sequencing through the kernels, there is a forward scan to find any allocated buffers that aren’t going to be used again. This can quickly be identified from the kernelInputBuffers and secondaryInputBuffers arrays we filled in when parsing the HLD.

If no dead buffer is found, the current pass needs to allocate a new buffer for its output to be stored in. To preserve only those output surfaces that are alive (will be used in later passes), we scan though the remainder of the passes and discard those that are dead. Besides the performance aspect of reusing dead buffers, for applications that deal with large surface data (as is the case with Blurate) this can also be necessary to minimize out of memory situations.

An important note here is on the dimensionality of the input and output buffers. In Blurate, since all modifications happen on the input surface (either the full thing or a specific portion of it), all intermediate surface buffers retain the original input size. Therefore there is no need for explicitly identifying the size of each intermediate buffers. In applications where the dimensionality of the surfaces may change between compute passes, however, the HLDL would require output dimensionality to be specified somewhere in the description of a compute pass. This would either be as a fixed value or as a ratio of the input dimension sizes. The best example of such a compute pass in neural networks would be a Maxpooling function – where the output surface typically has half the size of the input surface (each output data is the “Max” of 4 spatially neighboring data values from the input surface). In these cases preserving the amount of buffer space needed in the new surface also needs to be accounted for. There will be a tradeoff between the performance impact of releasing and reallocating a smaller surface, as opposed to keeping the larger surface allocated and just reusing only the needed portion of it. If alternatively the new required output surface is larger than the surface being freed, then there will be need for either a buffer resizing or freeing and reallocating. Also if multiple buffers can become free at the same time, there might be an opportunity to search for the largest one or combine them to form the new output buffer.

A further aspect to consider here is surface reusability within a single compute pass – i.e. passes that write to their input buffers. Such passes would potentially note require an output buffer to be allocated for them. Most compute APIs do allow for defining buffers and surface as readable, writable or both. In practice reading and writing to the same surface in the same compute pass is either not allowed or not guaranteed to execute in any particular order, unless some form of synchronization barrier is employed. If barriers were to be used for certain compute functions, allowing them to write to their input buffers, it would need to be accounted for in the liveness detection code. Specifically it would need to identify that certain compute passes do not need a buffer allocation for their output. And the owner of the overwritten buffer needs to be changed to the new pass, so it is not incorrectly identified as a dead buffer later on. In Blurate we do consider there to not be kernel code that employs barriers – which simplifies things for the Dispatch Feeder.

In general, the main problem with barriers is that if not used with care, they can cause bizarre performance artifacts across different compute devices, due to fragmentation of resources, depending on how a graphics architecture partitions up its resources to exploit locality. Therefore it is preferable to use “in-out” compute passes (that do not overwrite input buffers), and avoid barriers as much as possible. However, some tasks can benefit substantially from fine-grain serialization between writing to and reading from data buffers. Examples of these are Matrix Multiply and FFT operations, where storing highly reusable portions of data to be shared among work threads can notably improve memory efficiency and reduces stalls on memory access. Adding barrier and shared memory support for such functions into your dynamically generated compute kernels can be done on a per function basis and can, for the most part, be transparent to the HLDL. However, the one place where you need visibility into whether buffers are overwritten in a pass, is here in the Dispatch Feeder. It needs to know what surfaces are being read from and overwritten in the same compute pass, so that it can account for it in determining which buffers are alive and can’t be recycled in later passes.

Author: Ash

Studied computer architecture at NCSU. Just having fun with JITed compute graphics.