Automatic Generation of Specialized Direct Convolutions for Mobile GPUs

Citation for published version:
https://doi.org/10.1145/3366428.3380771

Digital Object Identifier (DOI):
10.1145/3366428.3380771

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Peer reviewed version

Published In:
GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim.
Automatic Generation of Specialized Direct Convolutions for Mobile GPUs

Naums Mogers  
University of Edinburgh  
Edinburgh, United Kingdom  
aums.mogers@ed.ac.uk

Valentin Radu  
University of Edinburgh  
Edinburgh, United Kingdom  
vradu@inf.ed.ac.uk

Lu Li  
University of Edinburgh  
Edinburgh, United Kingdom  
lu.li@ed.ac.uk

Jack Turner  
University of Edinburgh  
Edinburgh, United Kingdom  
jack.turner@ed.ac.uk

Michael O’Boyle  
University of Edinburgh  
Edinburgh, United Kingdom  
mob@inf.ed.ac.uk

Christophe Dubach  
University of Edinburgh  
Edinburgh, United Kingdom  
christophe.dubach@ed.ac.uk

Abstract
Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia's cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution.

To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is \( \times 10 \) faster than the direct convolution while using \( \times 3.6 \) less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.

1 Introduction
Convolutional neural networks [8] (CNN) dominate the field of computer vision and image processing. Due to the availability of parallel accelerators such as mobile GPUs, we are able to use CNNs to perform these complex tasks on resource constrained mobile devices. However, modern neural networks are computationally demanding, yielding large memory footprints and slow inference times, which has slowed their adoption in embedded settings.

CNNs typically have several convolution layers and one or more fully connected layers. Most of their execution time is spent in convolutions [15]. Convolutions slide several kernels across a multi-channel 2D image (e.g., the first input has typically three channels, RGB). The layers configurations vary significantly across networks and even among layers of the same network. For instance, in VGG architectures [19], the first convolutional layer operates on a 224×224 image with 3 channels while the 7th layer operates on a 112×112 image with 128 channels. The size and shape of convolutional kernels might also vary between networks or layers. This diversity in convolution input shapes represents a significant challenge for high-performance software engineers. In fact, obtaining good performance for rapidly evolving networks, hardware and workloads is a significant engineering challenge for library vendors relying on hand-coded solutions.

Most neural network libraries, such as Caffe [13] for CPU and CuDNN [5] for Nvidia GPU, solve this issue by expressing convolutions as General Matrix Multiplication (GEMM), since heavily optimized implementations are readily available. While this approach leads to high-performance, it significantly increases the required memory footprint, which can be a problem when running on mobile devices. For instance, a GEMM implementation of the 2nd convolutional layer of VGG requires 116 MB of memory for a single image while the direct convolution requires only 13 MB. If a large neural network processes multiple images (e.g., a video stream) at once, the device memory is quickly filled up.

Support for high performance direct convolution is not as common given that it is a specialized operation compared to the more generic GEMM. As a result, vendors typically do not invest as much effort in providing a tuned direct convolution implementation. As an example, the ARM Compute Library implementation of direct convolution only supports a handful of convolution shapes and is actually \( 10 \times \) slower than its GEMM counterpart on the ARM Mali GPU. This calls for an automatic approach that produces highly-specialized high-performance code for direct convolutions.
This paper presents an automatic code generation approach for direct convolution based on LIFT. LIFT expresses algorithms using a high-level data-parallel Intermediate Representation (IR). A system of rewrite rules optimizes LIFT expressions to specialize on the target architecture.

More specifically, this paper shows how CNN convolutions are expressed and optimized in LIFT. This is achieved by exploring a parametric space which includes tile sizes, amount of padding, amount of data reuse and the number of sequential operations performed by a thread. A series of constraints is automatically produced to restrict the search to valid combinations of tuning parameters (e.g., input size must be divisible by tile size). Using the latest generation of ARM Mali GPU, we demonstrate that LIFT generates high-performance direct convolution code that is on average ×10 faster than the ARM compute library direct convolution implementation, while using ×3.6 less space than GEMM-based convolution provided by the same library.

To summarize, the main contributions are:

- Show how we leverage LIFT to express the convolutional layers of neural networks;
- Evaluate a large optimization space of 1,000 points with LIFT;
- Produce code automatically for direct convolution that achieves a speedup of ×10 and memory saving of ×3.6 over the ARM’s own high performance library on the ARM Mali GPU.

2 Motivation

2.1 Convolutional Neural Networks (CNNs)

CNNs are the tool of choice for most computer vision problems. They are composed of stacked layers of convolutions over multi-channel inputs, where each layer produces a feature map per convolution kernel. In computer vision, the first image passed to a convolutional neural network has three channels, red, green and blue channels. They get transformed in scale and value based on the learned kernel weights at each layer.

For classification tasks, the output tensor flattens each feature map into a vector and passes it to one or more affine transforms. These affine transformations account for very little of the total inference time. For example, in SENet [11], the most recent ImageNet winner, convolution accounts for 99.99% of total floating point operations. Therefore, this paper focuses primarily on the convolution operation.

2.2 Direct Convolution

Each convolution kernel has a receptive field of spatial size \( kernel_{width} \times kernel_{height} \) in 2D, usually square, \( K \times K \), and a depth to match the input number of channels \( C \), across all \( M \) kernels. On an input image size \( C \times H \times W \) the direct convolution is performed with nested loops:

\[
\begin{align*}
\text{for } h & \text{ in } 1 \text{ to } H \\
& \text{for } w & \text{ in } 1 \text{ to } W \\
& \text{for } o & \text{ in } 1 \text{ to } M \\
& \text{for } a & \text{ in } 1 \text{ to } A \\
& \text{for } b & \text{ in } 1 \text{ to } B \\
& \text{for } c & \text{ in } 1 \text{ to } C \\
& \text{for } k & \text{ in } 1 \text{ to } K \\
& \sum & = 0; \\
& \text{sum} & += \text{input}[c][h+i][w+j]*\text{kernels}[o][i][j][c]; \\
& \text{output}[o][w][h] & = \text{sum}; \\
\end{align*}
\]

2.3 GEMM

The convolution operation is commonly implemented as matrix multiplication due to the availability of highly optimized GEMM routines available in libraries for both CPU (openBLAS) and GPU (CLBlas, cuDNN). This is achieved through the image to column (im2col) transformation, which unrolls each kernel into a row to form a matrix of all kernels, and each patch of image is mapped to a column to form another large matrix with a number of columns equal to the times each kernel should be convoluted over the image for the direct convolution approach. Matrices formed by each image channel are concatenated row-wise. The entire convolution operation is performed by executing one single dot product over these two large matrices using an efficient GEMM routine.

Figure 1 presents the im2col operation, where two \( 3 \times 3 \) kernels are convoluted on a single channel \( 5 \times 5 \) image. With direct convolution, the image has 25 elements and the two kernels have 9 elements each. To perform GEMM, kernels are unrolled into two rows, and through im2col the input is mapped to the input-patch matrix which is \( 9 \times \) larger than the original image. In total, this simple convolutional layer requires at least \( 9 \times \) more memory for the GEMM method than it would otherwise with the direct convolution.

2.4 Memory footprint

Figure 2 shows the actual run-time memory footprint required by the largest layer in the most popular deep neural networks. GEMM requires consistently more memory than direct convolution (one order of magnitude) due to increased memory size of the
transformed input, which can be a limitation when deploying on mobile and embedded devices.

3 Background: the LIFT System

The design goal of LIFT is to raise the programming abstraction and enable automatic performance optimizations on massively parallel accelerators, such as GPUs. LIFT provides a high level Intermediate Representation (IR) [20], and a compiler that automatically translates the high level IR to low level target code. The LIFT IR is functional where operations are side-effect free, enabling composition of LIFT primitives naturally. Optimizations choices are encoded using a system of rewrite rules that capture the algorithmic and hardware-specific optimizations.

3.1 LIFT Abstractions

LIFT includes hardware agnostic algorithmic primitives, and low-level primitives which encodes specific hardware details.

3.1.1 Algorithmic primitives

The main algorithmic primitives supported by LIFT and used in this paper are listed below. These algorithmic primitives only express what need to be computed, shielding programmers from any hardware-specific details.

- \( \text{map} : (f : T \rightarrow U, \text{in} : [T]_n) \rightarrow [U]_n \)
- \( \text{reduce} : (\text{init} : U, f : (U,T) \rightarrow U, \text{in} : [T]_n) \rightarrow [U]_1 \)
- \( \text{zip} : (\text{in1} : [T]_m, \text{in2} : [U]_n) \rightarrow [T \times U]_{m \times n} \)
- \( \text{split} : (m : \text{int}, \text{in} : [T]_n) \rightarrow [[T]_{m/n}]_m \)
- \( \text{join} : (\text{in} : [[T]_{m/n}]_m) \rightarrow [T]_{m \times n} \)
- \( \text{slide} : (\text{size} : \text{int}, \text{step} : \text{int}, \text{in} : [T]_n) \rightarrow [[T]_{\text{size} \times \text{step} \times \text{size} \times \text{step}}]_{n \times \text{size} \times \text{step}} \)
- \( \text{pad} : (l : \text{int}, r : \text{int}, \text{value} : T, \text{in} : [T]_n) \rightarrow [T]_{l+n+r} \)
- \( \text{transpose} : (\text{in} : [[T]_{m/n}]_m) \rightarrow [[T]_{n / m}]_n \)
- \( \text{reorder} : (f : \text{int} \rightarrow \text{int}, \text{in} : [T]_n) \rightarrow [T]_n \)
- \( \text{let} : (f : T \rightarrow U, \text{input} : T) \rightarrow U \)

3.1.2 Hardware-specific primitives

In order to support the generation of code for parallel accelerators, LIFT introduces low-level primitives that are tightly coupled with the hardware-specific programming model. We review briefly the main OpenCL primitives that are used in this paper to target a mobile GPU.

- \( \text{map\_wrg} \) and \( \text{map\_lcl} \) (Lift primitives only)

3.2 Example Stencil Program

This section reviews how LIFT expresses stencil computations [10], which forms the basis for convolutions. Listing 1 shows an example code using LIFT primitives to express a stencil computation. The function stencil2D takes a 3×3 weight array and a 2D image array.
The body consists of two parts: the data layout arrangement and the core computation. For data layout, it first creates a $3 \times 3$ sliding window using slide2D in line 10 which results in a 2D neighborhood. Then, two maps are used in line 4 to schedule the work to each local thread in each workgroup running on the GPU hardware.

Each thread performs the core computation part on a neighborhood, in lines 8 to 6. This process is visualized in fig. 3. First, two joins are used to flatten the 2D arrays: the weight array and the sliding window (i.e., neighborhood), into simpler 1D arrays. Then, the two 1D arrays are zipped into a single array of tuples, and reduced sequentially to a single scalar value which is the convolution output for each single image pixel position.

### 3.3 Code Generation Example

**Lift** produces parallel OpenCL code by walking the program IR tree and emits code for each primitive. The exception to this process are the primitives that are changing the data layout, such as join, split, zip, pad or slide. In these cases, the compiler builds an internal representation called a view [20], which captures the effects that these primitives have on data layout. When then, the data is accessed by other primitives, the compiler uses the information stored in the view to produce the right accesses to memory.

Listing 2 shows the code produced by the Lift compiler (with minor cosmetic changes such as naming and indentation) for the example in listing 1. First, a for loop for distributing the work among workgroup in the dimension 0 is generated on line 5 corresponding to the mapWrg. Then, a second loop for distributing the work among local threads is generated on line 7 corresponding to the mapLcl. The reduction accumulator is allocated in private memory and initialized on line 10. The reduction for loop follows, which accumulates the results of multiplying an element of the weight together with the corresponding element of the input data. Note that the array accesses are automatically generated using the information in the view built from the slide and zip primitives.

### 3.4 Optimization through Rewrites

Lift uses rewrite rules to encode optimization choices. This section briefly discusses two examples of such rewrites.

#### 3.4.1 Tiling

Tiling improves locality and enables work distribution to independent groups of threads. When tiling the input data of convolutions, care must be taken to ensure that the tiles overlap. To achieve this, tiling of convolutions is achieved by simply reusing the slide2D. This optimization is encoded using the following rewrite rule:

\[
\begin{align*}
\text{f} & (\text{slide2D}((\text{size1}, \text{size2}),(\text{step1}, \text{step2}), \text{input})) \\
\equiv & \quad \text{map2D}(f, \text{slide2D}((\text{ts1}, \text{ts2}), (\text{ts1-step1}, \text{ts2-step2})), \text{slide2D}((\text{size1}, \text{size2}),(\text{step1}, \text{step2}), \text{input}))
\end{align*}
\]

This rewrite matches a function $f$ applied to the results of a slide2D. The function $f$ could be performing a convolution as in the example from listing 1. In order to perform the tiling optimization, this rewrite replaces the matched expression by two level of nested slide2D and a map2D applied to $f$. The first slide2D at the bottom is the original one producing a 2D array of neighborhoods. The second one on top is the actual tiling of size $ts1 \times ts2$ which is performed by sliding the tile in 2D. The step is equal to the desired tile size minus the original step. This results in a 2D array of overlapping tiles containing 2D neighborhoods. The function $f$ is finally mapped in 2D over each tile.

#### 3.4.2 Vectorization

Vectorization is another example of an important optimization that highly benefits GPUs such as the ARM Mali GPU by using vector loads and stores and the built-in dot operator.

The following rewrite expresses this optimization:

\[
\begin{align*}
\text{map}(f, \text{input}) \\
\equiv & \quad \text{asScalar}(\text{map}(\text{vectorize}(f), (\text{asVector}(\text{input}))))
\end{align*}
\]

When a function $f$ is mapped, it is possible to vectorize the function with the Lift vectorize primitive. The asVector cast the input scalar array into vector type while asScalar does the reverse.

### 4 Direct Convolution in Lift

We now describe how a convolutional layer is expressed in Lift and introduces the low-level optimizations applied.

#### 4.1 High-level Lift Expression

Listing 3 shows the Lift expression of a convolution layer with three inputs. kernelsWeights contains the weights of all the kernels across the width, height and input channels. kernelsBiases are the biases, one per kernel. inputData contains the layer’s input which is a 3D array (width $\times$ height $\times$ input channels). padSize is a tuple of four values that specifies how much padding is required in each direction by the layer specification. kernelStride specifies by how much each kernel is displaced across the input (the step).

The output data is a set of feature maps represented as a 3D array with the outer dimension corresponding to the number of kernels. This Lift program in listing 3 consists of three steps. First, data is padded with zeros as per the configuration of the layer. Then, we slide in 2D across the padded input along the two spatial dimensions (inputWidth and inputHeight) producing the sliding windows. Finally, convolution is performed using a combination of Lift primitives. First, we map over each sliding window using
map2D on line 7. Then, kernelsWeights and kernelsBiases are zipped together on line 11 and mapped over on line 8. On line 9, we finally reduce over the flattened and zipped slidingWindow and singleKernelWeights. The zipping of the slidingWindow and singleKernelWeight ensures that the reduction operates on pair of corresponding elements from both arrays. The reduction operator multiplies the corresponding elements and adds to the accumulator which is initialized with a singleKernelBias.

#### 4.2 Low-Level Lift Expression

As shown in Listing 3, convolution is expressed as a set of reductions of sliding windows. However, in popular deep CNNs such as VGG, ResNet and GoogleNet, most convolutional layers are wide to such extent that the whole input does not fit in the cache (e.g., L2). We address this issue by tiling the input and splitting reduction in two steps. The first GPU kernel splits the input into tiles, each sliding window of each tile into chunks and reduces each chunk to a single value. This resulting vector of values per sliding window is reduced to one final value in the second GPU kernel.

To ensure that the tiles fit perfectly with the input sizes, extra padding might be required on the input using another GPU kernel before processing the data. Conversely, an extra GPU kernel might be required at the end to crop back the output. We discuss all four stages below.

#### 4.2.1 Padding

The padding expression has a dual purpose. First, it pads the input with zeros along all four edges as per the neural network architecture. Secondly, it zero-pads the input across the right and bottom edges so that the resulting array can be perfectly tiled. The amount of padding $p$ is determined automatically by a constraint solver and is explained later.

#### 4.2.2 Partial Convolution

Figure 4 presents an overview of the partial convolution algorithm. Acquiring input image and a set of convolutional kernels, we split the input into tiles and kernels – in kernel groups. Each combination of a tile and kernel group is processed by a single work group. Then, a window of the spatial size $kernelWidth \times kernelHeight$ is slid across the tile. This results in a set of sliding windows, which at this point are just virtual views into data.

Each sliding window is flattened across two spatial dimensions and input channels, and split into chunks. Each chunk is processed sequentially by a single thread. Each thread can process chunks from more than one sliding window. Each kernel is split into chunks accordingly; kernels are flattened across three dimensions and split. Each sliding window chunk is coupled with corresponding chunks in each of the kernels in the group. A thread processes each pairing of the input chunk with the kernels in a kernel group.

Processing each input-kernel chunk pair involves multiplying input values and corresponding weights, and summing the resulting vector. Thus, each sliding window is reduced to a vector of values, corresponding to each chunk in the sliding window. This is partial reduction; another expression further reduces the vector to each value resulting in a full convolution of each sliding window to a single output value.

Listing 4 shows our Lift algorithm. First, the input is tiled using slide2D and the 2D array of tiles is flattened (line 5). The tile size is controlled by the parameter $\theta$ and the stride is calculated to minimize the amount of tile overlap:

$$tilingStride = \theta - (kernelWidth \cdot kernelHeight - kernelStride)$$

We express convolution within each tile by nesting a second slide2D on line 6. This new five-dimensional view of the input data is further transformed using the inner expression on line 8. The 3D sliding window and convolutional kernels are represented as flat vectors; this simpler data layout enables coalescing of data.
def partialConv(kernelsWeights : [[[float]]], inputChannels : int, kernelWidth : int, kernelHeight : int, numKernels : int):
    paddedInput : [[[float]]], inputChannels : int, paddedWidth : int, paddedHeight : int
        kernelStride : (int, int))
    : [[[float]]], windowSize : int, windowStride : int, numKernels : int, nTilesInInput : int
    val tiledInput4D = join(slide2D(θ, tilinStride, paddedInput))
    val tiledSlidedInput5D = map(join(slide2D(kernelHeight, kernelWidth, kernelStride)), tiledInput4D)
    val windowSize = inputChannels * kernelWidth * kernelHeight
    def coalesceChunkVectorizeWindow(window : [[[float]]], windowStride : int, numKernels : int, nTilesInInput : int)
        : [[[float]]], windowSize : int, windowStride : int, numKernels : int, nTilesInInput : int
    val flatWindow1D = join(join(window))
    val flatCoalescedWindow1D = reorder(striddenIndex(windowSize/ω), flatWindow1D)
    val flatCoalescedChunkedWindow1D = split(ω, flatCoalescedWindow1D)
    asVector(ω, flatCoalescedChunkedWindow1D)
    val tiledSlidedCoalescedChunkedVectorizedInput4D = map(tile4D -> split(σ, map(window3D ->
        coalesceChunkVectorizeWindow(window : [[[float]]], windowSize : int, windowStride : int, numKernels : int, nTilesInInput : int)
    mapWrg([], inputTile3D ->
        mapWrg([], inputWinodws2D ->
            mapLcl([], (inputWindows2D, kernelsGroupWeights2D) ->
                mapSeq([], singleKernelReducedChunk ->
                    toGlobal(singleKernelReducedChunk),
                    reduceSeq(
                        init = mapSeq(toPrivate(id(Value(θ, [float]_)))),
                        f = (acc, inputsValuePrivate, kernelsGroupValue1D) ->
                            let(inputsValuePrivate ->
                                mapSeq([], singleKernelValue) ->
                                    mapSeq([], inputValuePrivate) ->
                                        accValue + vectorize(σ, dot(inputValuePrivate, singleKernelValue)),
                                        accValue + vectorize(σ, dot(inputValuePrivate, singleKernelValue)),
                                        zip(acc, kernelsGroupValue1D),
                                        mapSeq([], vectorize(σ, dot(inputValuePrivate, singleKernelValue))),
                                        zip(transpose([inputWindows2D, transpose(kernelsGroupWeights3D)]),
                                            inputTile3D),
                                            groupedCoalescedChunkedVectorizedKernelsWeights4D),
                                            tiledSlidedCoalescedChunkedVectorizedKernelsWeights4D))
        mapLcl([], inputWindows2D ->
            mapSeq([], singleKernelReducedChunk ->
                toGlobal(singleKernelReducedChunk),
                reduceSeq(
                    init = mapSeq(toPrivate(id(Value(θ, [float]_)))),
                    f = (acc, inputsValuePrivate, kernelsGroupValue1D) ->
                        let(inputsValuePrivate ->
                            mapSeq([], singleKernelValue) ->
                                mapSeq([], inputValuePrivate) ->
                                    accValue + vectorize(σ, dot(inputValuePrivate, singleKernelValue)),
                                    accValue + vectorize(σ, dot(inputValuePrivate, singleKernelValue)),
                                    zip(acc, kernelsGroupValue1D),
                                    mapSeq([], vectorize(σ, dot(inputValuePrivate, singleKernelValue))),
                                    zip(transpose([inputWindows2D, transpose(kernelsGroupWeights3D)]),
                                        inputTile3D),
                                        groupedCoalescedChunkedVectorizedKernelsWeights4D),
                                        tiledSlidedCoalescedChunkedVectorizedKernelsWeights4D))
    mapWrg([], inputWindows2D ->
        mapSeq([], singleKernelReducedChunk ->
            toGlobal(singleKernelReducedChunk),
            reduceSeq(
                init = mapSeq(toPrivate(id(Value(θ, [float]_)))),
                f = (acc, inputsValuePrivate, kernelsGroupValue1D) ->
                    let(inputsValuePrivate ->
                        mapSeq([], singleKernelValue) ->
                            mapSeq([], inputValuePrivate) ->
                                accValue + vectorize(σ, dot(inputValuePrivate, singleKernelValue)),
                                accValue + vectorize(σ, dot(inputValuePrivate, singleKernelValue)),
                                zip(acc, kernelsGroupValue1D),
                                mapSeq([], vectorize(σ, dot(inputValuePrivate, singleKernelValue))),
                                zip(transpose([inputWindows2D, transpose(kernelsGroupWeights3D)]),
                                    inputTile3D),
                                    groupedCoalescedChunkedVectorizedKernelsWeights4D),
                                    tiledSlidedCoalescedChunkedVectorizedKernelsWeights4D))

a single tile for a single kernel group; each thread reduces one or more sliding windows in one output channel.

4.2.4 Cropping
The final expression reverses the effect of the extra padding performed in the first expression. It crops the output using pad2D with negative values for padding sizes. The amount of horizontal and vertical cropping is calculated as:

\[ \text{cropSize} = \frac{\rho}{\text{kernelStride}} \]

The cropSize is guaranteed to be whole by the slide constraint discussed later in section 5.2.

5 Space Exploration
When exploring the search space of possible implementation, we leverage rich algorithmic information captured by the Lift IR. Type safety and provable correctness of rewrite rules allow to automatically explore structural code transformations that would otherwise require costly static analysis.

Lift supports symbolic parameter values into the types. Parameter tuning consists of finding valid combinations of tuning values, replacing them at the type level and generating a specialized implementation. This leads to GPU kernels that are specialized for the given input parameters and tuning values.

5.1 Tuning parameters and rewrite rules
Table 2 shows the tuning parameters.

<table>
<thead>
<tr>
<th>Symbol</th>
<th>Parameter</th>
</tr>
</thead>
<tbody>
<tr>
<td>( \theta )</td>
<td>Input tile size</td>
</tr>
<tr>
<td>( \rho )</td>
<td>Optimizational padding size</td>
</tr>
<tr>
<td>( \kappa )</td>
<td>Number of kernels per workgroup</td>
</tr>
<tr>
<td>( \sigma )</td>
<td>Number of sliding windows per thread</td>
</tr>
<tr>
<td>( \omega )</td>
<td>Sequentially processed input elements</td>
</tr>
<tr>
<td>( \upsilon )</td>
<td>Vector size</td>
</tr>
</tbody>
</table>

Table 2. Convolution expression tuning parameters

Unrolling During rewriting, Lift optionally unrolls the innermost reduction of the partial convolution. The compiler also removes the loops over work group or work item indices where the corresponding dimensions sizes are the same as the number of elements being processed.

5.2 Constraint inference
The expressiveness of Lift and the complex space produced by rewriting results in a high number of dependent and independent parameters which is hard to manual analysis. To address the problem of parameter validation, we used automatic constraint inference based on the information encoded in the IR and the type system. By traversing the AST, we collect variables from types and parameters, and infer continuous and discrete constraints on the parameter values. A constraint is expressed as a record specifying the condition that must hold true and the list of parameters the condition is imposed upon. We present examples of the constraints that are automatically derived from a Lift expression.

Algorithmic Algorithmic constraints are inferred based on the type of an IR primitive and the values of its parameters. Satisfying such constraints is required for producing semantically correct results. For the split primitive, the inferred constraint is as follows:

\[ \text{split} : (m : \text{int}, in : [T]_n) \Rightarrow n \% m = 0 \]

This constraint ensures that the split input is divisible evenly into chunks of \( m \) elements. The compiler traverses the arithmetic expression of the condition \( n \% m = 0 \) and collects all the parameters; they are marked as co-dependent.

asVector imposes a similar constraint to that of split:

\[ \text{asVector} : (m : \text{int}, in : [T]_n) \Rightarrow n \% m = 0 \]

slide comes in two conceptual flavours based on the constraints it imposes on the variables. The slideStrict requires that the sliding window covers perfectly the input:

\[ \text{slideStrict} : (\text{size} : \text{int}, \text{step} : \text{int}, in : [T]_n) \Rightarrow (n - \text{size}) \% \text{step} + 1 = 0 \]

slideStrict must be used for tiling, when the semantic correctness of the expression must be preserved for all parameter values. For kernel sliding, we use the normal slide since sliding is allowed to produce partial results; a notable example is the first layer of AlexNet [14].

Hardware The specifications of the target hardware impose the constraints on the maximum amount of threads in a single dimension, work group size, total memory allocated and maximum single
we use the ARM Compute Library (v19.02) with the Graph API,
A depends on the parameter B, B needs to be evaluated first. To
Table 3. All unique convolutional layer configurations of VGG-16
and the runtime [ms] evaluated for the ARM Compute Library
(Direct and GEMM) and the Lift-generated code for the HiKey 970
(Kirin 970 processor)
buffer size. These constraints can be inferred by calculating the min-
imum resources necessary to compute an expression and matching
them against respective OpenCL driver information.
5.3 Constraint solver
To explore the space of valid parameter value combinations for a
given layer configuration, we designed the following search strat-
egy. Firstly, we use the rewrite rule system to produce a parametric
candidate expression; this expression is traversed for parameter
and constraint inference. Next, we sort the parameters in the order
in which they need to be explored – for example, if the parameter
A depends on the parameter B, B needs to be evaluated first. To
find this ordering, we represent the collection of constraints as
a directed acyclic graph and sort it topologically. The resulting
partial sorting order is finalized by imposing a random order on
the unsorted groups of parameters. The derived parameter order is
used to incrementally generate random combination of parameter
values that satisfy all the constraints.
6 Experimental Methodology
Code generation The Lift compiler is used to generate the code
that runs on the GPU. We use an extended version of the Lift
compiler to also generate OpenCL host code that sets up the de-
vice, compiles the GPU code, sends/retrieves the data and executes
the GPU code. For each layer configuration, we generated 1000
randomly chosen implementations that satisfy all the constraints.
As a baseline to evaluate the performance of our generated code,
we use the ARM Compute Library (v19.02) with the Graph API,
implementing the same layers and running these on the GPU by
indicating c as the target from the API. All the ARM compute
library results are produced using ARM’s built-in auto-tuner.
Benchmarks To evaluate the code generated, we use all nine
unique layer configurations of the VGG-16 model [19]. This network
is well-studied performance in literature and has higher resource
requirements than others such as ResNet and GoogleNet [3]. Table 3
presents the layer configurations.
All results are validated by using a fixed random input and
comparing the output with that of PyTorch.
Platform In this paper, we target the ARM Mali-G72 (12 cores)
mobile GPU using the HiSilicon Kirin 970 SoC running Debian
GNU/Linux 9.8. The highest frequency (767MHz) was used.

<table>
<thead>
<tr>
<th>Layer</th>
<th>Input</th>
<th>Conv ARM ARM Lift</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>3x224x224</td>
<td>64x3x3</td>
</tr>
<tr>
<td>2</td>
<td>64x224x224</td>
<td>64x3x3</td>
</tr>
<tr>
<td>5</td>
<td>64x112x112</td>
<td>128x3x3</td>
</tr>
<tr>
<td>7</td>
<td>128x112x112</td>
<td>128x3x3</td>
</tr>
<tr>
<td>10</td>
<td>128x56x56</td>
<td>256x3x3</td>
</tr>
<tr>
<td>12 14</td>
<td>256x56x56</td>
<td>256x3x3</td>
</tr>
<tr>
<td>17</td>
<td>256x28x28</td>
<td>512x3x3</td>
</tr>
<tr>
<td>19 21</td>
<td>512x28x28</td>
<td>512x3x3</td>
</tr>
<tr>
<td>24 26 28</td>
<td>512x14x14</td>
<td>512x3x3</td>
</tr>
</tbody>
</table>

Table 3. All unique convolutional layer configurations of VGG-16
and the runtime [ms] evaluated for the ARM Compute Library
(Direct and GEMM) and the Lift-generated code for the HiKey 970
(Kirin 970 processor)

GPGPU ’20, February 23, 2020, San Diego, CA, USA
N. Mogers et al.

7 Evaluation
This section explores the performance of the automatically gen-
erated direct convolution in Lift. A comparison is given against
the best hand-written library for the ARM Mali GPU: the ARM
Compute Library.

7.1 Comparison with ARM Compute Library
Table 3 shows the execution times of the Lift-generated OpenCL
kernels and the ARM Compute Library direct convolution and
GEMM implementation. Both these versions have been auto-tuned
using the tools provided by the ARM Compute Library. As evident
from the results, the Lift-generated code is always faster than the
ARM Compute Library direct convolution and more space-efficient
than its GEMM method. Furthermore, in some cases it is actually
on par or better than the highly tuned GEMM implementation.
Figure 5 shows the performance of the Lift generated code
expressed as throughput – amount of useful outputs generated per
second – compared to that of direct and GEMM-based convolution
from the ARM Compute Library. For every layer, Lift is faster
than the ARM Compute library direct convolution and is ×10 faster
on average. While Lift kernels achieve only ×0.7 the throughput
of the GEMM-based implementation, the memory consumption
is ×3.6 less and is close to that of the vanilla direct convolution.
This demonstrates that our approach based on automatic code
generation outperforms a human expert.

7.2 Multi-objective optimization
Depending on application, priorities in neural network inference
optimization might shift. In a resource-bound system such as a
mobile GPU that is shared among multiple tasks, low memory
footprint is required; for time-critical tasks, throughput or latency are to
be prioritized. Figure 6 demonstrates how search space exploration
allows for multi-objective optimization to cater for various budgets:
advancing the Pareto frontier results in a set of implementation
candidates to choose from statically or at runtime for specific time
and space requirements. In the case of VGG layer 2, the compiler
might prioritize space efficiency by using 25 MBytes to compute
results in 100 ms; when the memory budget is bigger, the compiler
can prefer the 77 ms kernel that uses 31 Mbytes of space.

Populating a sizeable Pareto set is made possible thanks to the
exploration of the tuning parameter search space, performed in
a safe way thanks to constraint inference. Compared to libraries
that depend on sets of handwritten kernels, a compiler can adapt
to finer differences in the workload and target hardware.
Figure 5. Throughput and memory consumption comparison of Lift-generated kernels versus the direct and GEMM-based convolution methods on VGG-16

7.3 Analysis of the Best Point

We now analyze one of the best points that we found using the 7th layer of VGG as an example. Table 4 shows the best tuning parameters found together with the thread local sizes for the GPU kernel responsible for performing a partial convolution. These parameters show that a workgroup processes a tile which can fit $9 \times 5$ sliding windows. 4 out of 128 kernels are processed by a work group, enabling reusage of the input data multiple times, without adding too much register pressure; 3 out of 9 sliding windows are processed by each thread, enabling reusage of the weight data. The amount of padding is also quite minimal, which avoids unnecessary work. We also see that this point is vectorized which is good for memory loads on the Mali-G72 architecture.

8 Related Work

Several deep learning frameworks have recently been developed. Most of these frameworks rely on high-level graph-based representations of neural networks [1, 3, 13, 18] to allow for automatic differentiation. Such graphs are too high-level to be mapped optically to specific hardware, so frameworks rely on hand-written code provided by hardware vendors, as found in Intel’s MKL-DNN, Nvidia’s TensorRT and ARM’s Compute Library.

To address this, multi-level graph representations such as MXNet, XLA and TVM [3, 4, 17] have also been proposed, allowing subgraph and dataflow optimization to be made device-specific. TensorComprehensions [24] make use of the polyhedral compilation model to perform operation optimization and scheduling, but so far only target CUDA-capable GPUs. Depending on the target hardware, MXNet either provides handwritten layer implementations which lack portability or using BLAS libraries such as Atlas, MKL, CuBLAS and OpenBLAS. These libraries are also constrained in how much they can adapt to the target hardware relying just on tuning and handwritten code selection. Another code generator with auto-tuning is Latte [21], which has shown good performance for CPU code, although not evaluated on mobile devices. Their performance is achieved by generating code with cross-layer fusion, which is problematic for modeling exact layer conditions. On mobile platforms, MXNet only supports CPU-based libraries.

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Input tile size</td>
<td>$5 \times 5$</td>
</tr>
<tr>
<td>Number of kernels per workgroup</td>
<td>4</td>
</tr>
<tr>
<td>Number of windows per thread</td>
<td>3</td>
</tr>
<tr>
<td>Sequentially processed input elements</td>
<td>144</td>
</tr>
<tr>
<td>Optimizational padding size</td>
<td>11</td>
</tr>
<tr>
<td>Vector size</td>
<td>4</td>
</tr>
<tr>
<td>Unrolling</td>
<td>No</td>
</tr>
<tr>
<td>Coalescing</td>
<td>Yes</td>
</tr>
</tbody>
</table>

Table 4. Best parameters found for layer 7 of VGG-16
Other works have recently explored efficient implementations of direct convolution [2, 9, 25] but are limited in the scope of their available target platforms. In particular, [9, 25] are reliant on the availability of SIMD instructions and are specific to CPUs. Tsai et al. [22] rely on efficient implementation of OpenCL kernels to reduce memory requirement of GEMM by avoiding replication of input patches, however this is not fast enough for mobile devices.

There have also been several developments at the algorithmic level allowing for fast approximations to convolution [16, 23], or computationally cheaper substitutions [6, 7, 12]. In this work we have not considered such approximate methods, but leave them for future exploration.

9 Conclusions

Most machine-learning frameworks rely on GEMM to implement convolutions due to the availability of high-performance implementations on most parallel devices. The downside is that GEMM requires an order of magnitude more memory than direct convolution, which can restrict the application of neural networks for memory limited embedded devices. Direct convolution is an attractive alternative, however, hardware-vendor provided implementations are often an order of magnitude slower than their GEMM counterpart.

This paper has shown how we automatically generate high performance direct convolution with Lift for the ARM Mali GPU. This approach leads to a $\times 10$ speedup and $\times 3.6$ memory saving over the tuned ARM Compute Library implementations.

Acknowledgments

This work was supported by the Engineering and Physical Sciences Research Council (grant EP/L01503X/1), EPSRC Centre for Doctoral Training in Pervasive Parallelism at the University of Edinburgh, School of Informatics.

References


