High-Level Hardware Feature Extraction for GPU Performance Prediction of Stencils

Citation for published version:

Digital Object Identifier (DOI):
10.1145/3366428.3380769

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Peer reviewed version

Published In:
GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim.
High-Level Hardware Feature Extraction for GPU Performance Prediction of Stencils

Toomas Remmelg  
The University of Edinburgh  
Edinburgh, Scotland, United Kingdom  
toomas.remmelg@ed.ac.uk

Bastian Hagedorn  
University of Münster  
Münster, Germany  
b.hagedorn@wwu.de

Lu Li  
The University of Edinburgh  
Edinburgh, Scotland, United Kingdom  
lu.li@ed.ac.uk

Michel Steuwer  
University of Glasgow  
Glasgow, Scotland, United Kingdom  
michel.steuwer@glasgow.ac.uk

Sergei Gorlatch  
University of Münster  
Münster, Germany  
gorlatch@wwu.de

Christophe Dubach  
The University of Edinburgh  
Edinburgh, Scotland, United Kingdom  
christophe.dubach@ed.ac.uk

Abstract

High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these high-level functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs.

Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The first approach has fast compile times, however, it is not performance-portable across different devices and requires a lot of human effort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features.

Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This relies on the use of the Lift type system together with the Lift rewrite rules attempt to define high-level semantic information. It also shows how cache locality of low-level information useful for predicting performance using performance counters or low-level static code features.

Using the extracted features, a performance predictor is built automatically and adapts to different devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features.

Using the Lift framework, this paper demonstrates how low-level, GPU-specific features are extractable directly from a high-level functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.

Keywords  Performance models, GPUs optimizations, Stencil computation, Features extraction

1 Introduction

Recent years have witnessed the emergence of high-level approaches for high-performance computing such as Accelerate [22], Futhark [10], Delite [4], Lift [35] and AnyDSL [19]. They enable programmers to write hardware-agnostic code while putting the burden on the compiler to extract performance. Tuning a compiler is very laborious and time-consuming, especially when considering accelerators such as GPUs (Graphics Processing Units) and this process has to be repeated for every new hardware generation.

Lift proposes to use rewriting [34] to solve this problem. Rewriting for compiler optimizations is an approach first proposed in 2001 in the Haskell compiler [30]. Lift’s rewrite rules attempt to define the set of all possible algorithmic and, crucially, hardware-specific optimizations. Rewrite rules liberate compiler writers from having to implement hard-coded optimizations and make it easy to extend the compiler. Optimizations are simply implemented as rules and a generic rewriting engine explores the space automatically.

However, this approach results in a large optimization space. The optimization process takes a few hours for stencils on GPUs [9], even when using an efficient auto-tuner [1]. In response, this paper develops an automatic performance model predicting the best optimized program variant using static features from the high-level Lift IR. This removes the necessity for compiling and running programs which accounts for the majority of the exploration time.

The use of performance modeling for GPUs is not novel [11, 12, 26, 27, 38]. However, to the best of our knowledge, this is the first paper to show how information about low-level GPU-specific features is extractable from a high-level functional IR. This paper demonstrates that a high-level IR is amenable to the extraction of low-level information useful for predicting performance using high-level semantic information. It also shows how cache locality information is extractable at this level. This relies on the use of the rich information stored in the Lift type system together with the ability to reason about array indices in a symbolic manner.

Using the extracted features, a performance predictor is built using machine-learning. This leads to a highly accurate model for the stencil domain, an important class of high-performance code. The model achieves a correlation of 0.8 and 0.9 on GPUs from Nvidia and AMD, respectively. Using the model to search the space requires less than 5 runs in the majority of the cases to achieve performance within 90% of the best available. In comparison, a random search requires 100s of runs in the majority of the cases.

To summarize, the paper makes three contributions:
Figure 1. LIFT compilation and exploration. a) The current approach compiles and executes all transformed expressions. b) The new strategy ranks the transformed expressions with a model and only compile and execute the best ones.

- It shows how low-level GPU hardware features are extracted from a high-level functional IR;
- It presents a simple unsupervised learning approach using PCA and Clustering that predicts program performance;
- It shows how low-level hardware features are extracted from the high-level functional IR while Section 4 will discuss feature extraction.

The rest of this paper is organized as follows: Section 2 motivates this work while Section 3 presents background information about OpenCL and LIFT. Section 4 explains how low-level hardware features are extracted from the high-level LIFT IR, and Section 5 presents the performance model. Section 6 analyses the features and the model performance while Section 7 shows that the model is able to speed up drastically the optimization space exploration. Finally, Section 8 discusses related work and Section 9 concludes.

2 Motivation

Current LIFT Exploration  LIFT [34] explores the GPU optimization space using rewrite rules. Figure 1a presents an overview. First, a high-level expression representing the program is used as an input to the compiler. This generic high-level expression does not encode any optimizations. Then, the rewriting takes place and the LIFT exploration module applies rewrite rules to search the space randomly. This results in a set of transformed expressions where optimizations have been applied and parallelism has been mapped.

The transformed expressions are then fed into the LIFT code generator which produces OpenCL kernels. These kernels are compiled with the vendor-provided OpenCL compiler into binaries. Finally, all binaries are executed, the performance is recorded and the best found kernel is reported.

This process is time consuming as it produces a large number of kernels (1,000 for this paper). In addition, every LIFT generated kernel is executable with a different number of threads leading up to 10,000 kernel executions.

Time breakdown  Figure 2 shows the percentages of the time spent in the different stages of the current LIFT compilation and exploration. Unsurprisingly, the last part of LIFT’s workflow, the kernel execution, requires by far the most time (up to 90%). For this paper, executing all kernels for a single application, including the exploration of thread configurations took up to 41 minutes while all kernels were generated in less than a minute, which is about 2% of the overall time.

Using a Performance Predictor for Exploration  The major bottleneck for exploration is clearly the OpenCL compilation and execution time of the generated kernels, which represent 98% of total time. This paper addresses this bottleneck by using a trained performance predictor directly on the transformed LIFT expression. Figure 1b shows how the exploration strategy is modified with a performance model.

Once the transformed expressions have been produced, the idea is to extract features that are informative about performance. These features are fed into a predictive model which almost instantaneously ranks the transformed expressions. Then, the transformed expression with the fastest predicted performance is selected, the corresponding kernel generated, compiled and finally executed.

While this approach seems very simple, the challenges are twofold. First, we need to identify features that are informative about performance, such as memory access patterns. Then, they need to be extracted from the high-level functional LIFT IR. As we will see, the LIFT IR encodes all the required information to calculate low-level GPU-specific features. The next section gives background information about the LIFT IR while Section 4 will discuss feature extraction.

3 Background

This section introduces OpenCL, the existing LIFT IR, and the rewrite systems used to produce efficient OpenCL kernels.

3.1 OpenCL

An OpenCL program (kernel) is executed by multiple threads (work-items) organized in work-groups, providing a two-level thread hierarchy. Both work-items and work-groups are organized in three-dimensional grids identified by unique IDs. On GPUs, multiple
work-groups are executable on a core, and work-items are scheduled in groups of 32 for Nvidia (warp) or 64 for AMD (wavefront). It is generally desirable to start a large thread number to reach maximum occupancy.

OpenCL provides a three-level memory hierarchy: Global memory is accessible by all work-items and throughput is maximized when threads in the same warp/wavefront access the same cache line (coalesced accesses). Work-items of the same group communicate via a fast shared local memory, and each work-item has its own private memory.

3.2 Lift IR
Lift [34, 35] is a functional language based on lambda calculus, offering a small set of reusable primitives. It is a compiler-internal data-parallel intermediate language and is compiled to high-performance OpenCL code. Lift’s distinguishes algorithmic primitives which express what to compute, from OpenCL-specific primitives which express how to compute by explicitly mapping computations to the OpenCL programming model. Lift’s type system supports scalar types (e.g. int, float), tuple-types (denoted as $U \times T$), and array-types (denoted as $[T]^n$), where the array size $n$ is part of the type.

**Algorithmic Primitives** Lift provides well-known functional primitives defined on arrays as listed below:

- $map: (f : T \rightarrow U, \text{in} : [T]^n) \rightarrow [U]^n$
- $reduce : (\text{init} : U, f : (U,T) \rightarrow U, \text{in} : [T]^n) \rightarrow [U]$
- $zip : (\text{init} : [T]^n, \text{init2} : [U]^n) \rightarrow [T \times U]^n$
- $iterate : (\text{in} : [T]^n, f : [T]^n \rightarrow [T]^n, m : \text{int}) \rightarrow [T]^n$
- $split : (m : \text{int}, \text{in} : [T]^n) \rightarrow ([T]^m)^{n/m}$
- $join : (\text{in} : ([T]^m)^n) \rightarrow [T]^{m \times n}$
- $slide : (\text{size} : \text{int}, \text{step} : \text{int}, \text{in} : [T]^n) \rightarrow ([T]_{\text{size}})^{n-\text{size} \div \text{step}}$
- $pad : (l : \text{int}, r : \text{int}, h : (i : \text{int}, \text{len} : \text{int}) \rightarrow \text{int}, \text{in} : [T]^n) \rightarrow [T]_{l+n+r}$
- $at : (i : \text{Cst}, \text{in} : [T]^n) \rightarrow T$
- $get : (i : \text{Cst}, \text{in} : [T]^n) \rightarrow T$
- $array : (n : \text{int}, f : (i : \text{int}, n : \text{int}) \rightarrow T) \rightarrow [T]^n$
- $userFun : (s1 : \text{Scalar}T, s2 : \text{Scalar}T', \ldots) \rightarrow \text{Scalar}U$

Lift supports the definition of arbitrary scalar-based sequential OpenCL-C functions called $userFun$. These are directly embedded in the generated OpenCL code.

**OpenCL-specific primitives** Lift’s OpenCL specific primitives expose OpenCL’s thread and memory hierarchy. These primitives are used to explicitly dictate how to perform the computation expressed with the algorithmic primitives.

Parallelism is exposed via specialized variations of $map$, $mapGlobal$, $mapWorkgroup$, $mapLocal$, and $mapSeq$. These primitives directly correspond to OpenCL’s thread hierarchy. The computation specified within a OpenCL-specific $map$ is performed by its particular level and dimension $d \in \{0, 1, 2\}$ of the thread hierarchy, or executed sequentially by a single thread ($mapSeq$). OpenCL’s memory hierarchy is exposed via $toGlobal(f)$, $toLocal(f)$ and $toPrivate(f)$, which specify where the output of the function $f$ is stored in memory.

3.3 Rewriting
Lift encodes optimizations as semantics-preserving rewrite rules. These rules are used to transform a high-level expression written using the algorithmic primitives into a transformed expression in which parallelism and memory is explicitly exploited. Similar to Lift’s primitives, rewrite rules are also categorized into algorithmic or OpenCL-specific rules. Algorithmic rules such as the divide-and-conquer rule:

$$map(f) \rightarrow \text{join} \circ \text{map}(\text{map}(f)) \circ \text{split}(n)$$

create a space of possible algorithmic implementations for the same expression. OpenCL-specific rules such as:

$$map(f) \rightarrow \text{mapGlobal}(f)$$

map expressions to the OpenCL’s programming model.

3.4 Example
Listing 1 shows a 1D 3-point stencil expressed in Lift [9]. $pad$ is applied adding one element (0) to the left and right of the input array arg to implement a simple boundary handling. $slide$ creates overlapping neighborhoods of three elements which are summed up using $map$ and $reduce$.

Applying rewrite rules leads to Listing 2, where overlapped tiling has been applied. Every tile is processed by a work-group ($mapWrg$) loading all elements to local memory and computing the output using its work-items before storing it in global memory. From this expression high-performance OpenCL code is generated as shown in [9].

4 Feature Extraction
This paper proposes a performance model that predicts the performance of transformed Lift expressions on GPUs in order to identify the best variant. The model relies on static features extracted from the high-level Lift IR. Although the features are extracted at a high-level, they capture information about low-level hardware features. They broadly fall into three categories as seen in Table 1.

4.1 Parallelism
For a fixed input size, the number of launched threads influences how much parallelism versus sequential work is performed. We include both global and local thread counts across the three thread dimensions as features. Local thread count affects how large each work-group will be, which may affect data reuse or the number of concurrent groups.
4.2 Memory

This section covers the features related to the amount of memory allocated, number of accesses, and access patterns.

4.2.1 Local memory usage

One of the important factors that determines performance on a GPU is occupancy. Occupancy is typically maximized when multiple work-group execute concurrently. More concurrent work-groups typically translates to more threads executing concurrently, which ultimately helps hiding memory latency.

The number of work-groups that execute simultaneously on a core depends on the amount of resources used by each work-group. One important resource is the amount of fast local memory (shared memory) used by the work-group. Therefore, it is crucial to determine this quantity.

Extracting the amount of local memory used in a Lift program is straightforward. The program is traversed once, collecting memory allocation sizes and summing up these numbers.

4.2.2 Number of Memory Accesses

Performance is largely affected by the amount and type of memory operations. Applications that exhibit large amount of data re-usaged will benefit from exploiting the fast local memory. The program can simply reuse the data in local memory several times, reducing the number of global memory accesses, resulting in increased performance.

Algorithm The Lift code generator only produces loads and stores to memory when a user function is called. Therefore, counting the number of loads and stores boils down to counting how often each user function is called. As can be seen in Algorithm 1, a depth-first traversal is performed on the IR while keeping track of the number of times the body of patterns generating loops is executed. Once a user-function is reached, the feature extractor simply updates the total number of loads and stores. In addition to this, the extractor keeps track of the type of memory being accessed, local or global, using the toLocal and toGlobal patterns. The information about the address space is encoded directly into the IR and is populated by another pass that runs prior to feature extraction. The number of global/local loads and stores is then normalized by the number of total threads.

Algorithm 1: Pseudo-code for counting the total number of loads/stores for each type of memory.

```java
input :Lambda expression representing a program
output :Numbers of different types of memory accesses.
countAccesses(lambda) 1 totalLoad[local] = 0; totalLoad[global] = 0 2 totalStore[local] = 0; totalStore[global] = 0 3 countAccessesExpr(lambda.body, 1) 4 return [totalLoad, totalStore] 5 countAccessesExpr(expr, iterationCount) 6 switch expr do 7 case fc@FunCall 8 foreach arg in fc.args do 9 countAccessesExpr(arg, iterationCount) 10 case expr.f do 11 switch expr.f do 12 case it@Iterate 13 countAccessesExpr(it.body, iterationCount * it.count) 14 case it@MapSeq 15 countAccessesExpr(m.body, iterationCount * m.count) 16 case if@If 17 countAccessesExpr(if.body, iterationCount * it.count) 18 case fun@UserFun 19 foreach arg in fc.args do 20 | totalLoad[arg.addrSpace] += iterationCount 21 totalStore[arg.addrSpace] += iterationCount 22 otherwise do // Nothing to count. 23 otherwise do // Nothing to count. 24 return counts
```

Listing 3. Example for memory access count extraction.

Example Consider the program in Listing 3. The algorithm starts with the top-level lambda and soon encounters the mapWrg primitive. At this point in the algorithm, line 14, n will be N/64 (the length of the outer dimension of the input after the split). The algorithm calls recursively countAccessesExpr with N/64 as the iterationCount. When visiting either of the mapLcl in line 3 of Listing 3, n will this time be 64 (the length of the inner dimension of the input after the split).

When the add function is visited, global loads is updated twice, since the add function has two inputs (the tuple is automatically unboxed). Since at this point, the iterationCount is N/64 * 64 = N, the total number of global loads is N + 2, and the total number of local stores is N. When the multByTwo function is visited, local reads and global store are both updated once, resulting in N local loads and N global stores.

4.2.3 Memory Access Patterns

The way a program accesses memory has a profound impact on performance. GPUs coalesce several memory requests into a single one when threads in the same warp/wavefront access a single cache line (typically 128 bytes). It is, therefore, important to extract information about memory access patterns for building an accurate performance predictor.

General Algorithm To determine the total number of cache line reads, our feature extractor recursively traverses the IR, keeping

<table>
<thead>
<tr>
<th>Type</th>
<th>Feature</th>
</tr>
</thead>
<tbody>
<tr>
<td>Parallelism</td>
<td>global size (dimensions 0, 1 and 2) local size (dimensions 0, 1 and 2)</td>
</tr>
<tr>
<td>Memory</td>
<td>amount of local memory allocated global stores per thread global loads per thread local stores per thread local loads per thread average cache lines per access per warp</td>
</tr>
<tr>
<td>Control Flow &amp; Synchronization</td>
<td>barriers per thread if statements per thread for loop bodies executed per thread</td>
</tr>
</tbody>
</table>
track of the iteration count. When a memory access is encountered, it determines the number of unique cache lines accessed by the warp as follows. First, it generates the actual index expression using the existing mechanism of the LIFT compiler [35]. If the expression contains no thread id, it means that all the threads are accessing the same cache line.

When the expression contains a thread id, a new index expression is generated for each thread in the warp by adding a constant to its id (threads in a warp have consecutive ids). Let’s denote the original array index expressed as a function of the thread id as $access(tid)$. Given $n$, the number of threads in a warp, the set of array indices accessed by the warp is:

\[
\{access(tid + 0), access(tid + 1), \ldots, access(tid + n - 1)\}
\]

This list of indices expresses the different addresses accessed by a warp. Given the cache line size $s$ (expressed as a multiple of data size), we compute the list of cache lines accessed:

\[
\{access(tid + 0)/s, access(tid + 1)/s, \ldots, access(tid + n - 1)/s\}
\]

Finally, we can subtract the elements in the list with each other to identify which ones are equal (when the subtraction results in $0$) and count the number of unique accesses.

**Implementation details** The approach explained above is conceptually correct, however, it relies on having the ability to symbolically simplify arithmetic expressions. While the LIFT arithmetic simplifier supports a significant set of simplifications, it is not powerful enough to deal with some simplifications. In such cases, the feature extractor might fail to recognize identical accesses. The following paragraphs explain a few workarounds used inside the feature extractor.

The first issue we encountered, is the difficulty in calculating the set of unique cache lines by subtraction. Conceptually, one could take the first access $access(tid + 0)/s$, subtract every other accesses by it and hope that the algebraic simplifier would be able to return $0$ in case where two accesses are identical. Simplifying expressions as simple as

\[
(tid + 0)/s - (tid + 1)/s
\]

which is $0$ when $s > 1$, is far from trivial given that $/$ represents the integer division.

To overcome this challenge, we modify our approach slightly and add an extra step. Before dividing by $s$, we first calculate all the relative array accesses as an offset of the first access by simple subtraction. The intuition behind this is two-fold. First, it is much easier to simplify a subtraction if it does not contain terms with integer division. Second, we only care about the distances between the accesses rather than their absolute location, therefore, we will still be able to identify the number of unique cache line accessed.

So if the original accesses are

\[
\{tid + 0, tid + 1, \ldots\}
\]

they become

\[
\{(tid + 0) - (tid + 0), (tid + 1) - (tid + 0), \ldots\}
\]

which simplifies trivially to \{0, 1, \ldots\}. Then, we perform the division as before, which leads to \{0/s, 1/s, \ldots\} which trivially simplifies to \{0, 0, \ldots\}. Now it is much easier to identify the unique cache lines.

---

**Example** Consider the example program in Listing 4. The array index being read for the argument of $f$ is $1 + n \times gl_id$, where $i$ is the iteration variable of the mapSeq and $gl_id$ the global thread id. Depending on the split factor $n$, a different number of cache lines will be accessed by a warp. With a split factor of $n = 1$, a single cache line would be accessed since the accesses within a warp are consecutive. However, if the split factor is larger than the warp size, then each warp will be touching a different cache line.

With a cache line of 32 words, 32 threads per warp and 1 word for float, the cache line indices within a warp are:

\[
\{(i + n \times gl_id), (i + n \times (gl_id + 1)), \ldots, (i + n \times (gl_id + 31))\}
\]

Using the trick presented earlier, we can express all indices as an offset from the first one:

\[
\{(i + n \times gl_id)-(i + n \times gl_id),
\quad i + n \times (gl_id + 1)-(i + n \times gl_id),
\quad \cdots,
\quad i + n \times (gl_id + 31)-(i + n \times gl_id)
\}
\]

which simplifies trivially to \{0, n, \ldots, n \times 31\}. Now dividing by the cache line size, we obtain \{0, n/32, \ldots, n \times 31/32\}.

If the split factor $n$ is 1, this results in 32 zeros, meaning all the thread in the warp access a single cache line. When the split factor $n = 4$, this will results in the following list: \{0, 0, 0, 0, 1, 1, 1, 1, \ldots, 7, 7, 7, 7\}. Since it has 8 unique values, the warp touches 8 cache lines for this memory access.

4.3 Control Flow and Synchronization

Another important factor that often limits performance on GPUs is control flow and synchronization. *if-then-else* and for *loop* statements produce branching instructions which is notoriously bad for GPU performance because they typically cause control flow divergence within warps. Similarly, barriers are detrimental to performance since execution is altered until all threads have reached the barrier. For this reason, the feature extractor determines the total number of *if-then-else*, for *loops* and barriers produced by the code generator.

**Algorithm** This is similar to the algorithm used to count the number of memory operations. It traverses the IR recursively, keeping track of the number of times each function is executed. Whenever a pattern that might produce a loop (e.g. iterate, mapLocal, reduceSeq) is encountered, it checks whether a loop will be emitted and update a global loop counters, taking into account the current iteration count.

The algorithm also detects special cases where loops might not be emitted. There are two cases to consider. First, when a mapSeq iterates over an array of size 1, it is clear that a loop is not required. The second case is more subtle and involves mapLocal, mapWrg or mapGlobal. If the size of the input array is smaller than the number of local threads, workgroups or global threads, respectively, the code generator will emit an *if-then-else* statement instead of a loop since the loop can at most be executed once per thread or workgroup.

To determine the number of barriers, the algorithm looks at mapLcl as OpenCL only has barriers inside workgroups. The LIFT
code generator detects unnecessary barriers [35] and tags the call to mapLcl when it is not required. Therefore, we run this barrier elimination pass before feature extraction, and we use this information to ignore the mapLcl which have been marked as not requiring a barrier.

4.4 Use of High-Level Semantic Information

Another practical issue has to do with the pad pattern which is used to implement boundary conditions in stencil programs. Listing 5 shows a simple stencil program applying a clamping boundary condition which simply repeats the outermost value in case of out-of-bounds accesses. Listing 6 shows the generated pseudo-OpenCL code for this program. The pad pattern introduces a lot of ternary operators ?: which check that every memory access is in bound. This operator makes it harder for the simplifier to subtract memory accesses with each other to identify unique cache lines.

To overcome this, we exploit the available high-level semantic information: the padded data is rarely accessed and most accesses are in bound. The feature extractor focuses on the common case by simply ignoring the ternary operator and calculate the index for the common case. Identifying the common case by statically analyzing the OpenCL code is much harder even for this simple example. We would have to predict the common case for two ternary operators which depends on two opaque function calls (global_id and global_size) to the OpenCL library.

4.5 Summary

This section has shown how low-level GPU-specific features are extracted from the Lift IR. Memory-related, control flow, and synchronization features are extracted using information about the length of arrays from the type. We have seen how the fine-grained memory feature related to cache lines accesses is computed using the power of the Lift symbolic arithmetic expressions. The next section explains how we build a simple performance model using these extracted features.

5 Performance Model

Having seen how hardware-specific information is extracted from the high-level IR, we now focus on the performance model. It is based on k-Nearest Neighbors (kNN), which makes prediction based on the distance between programs in the feature space. Intuitively, Lift programs that exhibit similar features are likely to have similar performance.

5.1 Output Variable

The prediction output is throughputs normalized by the maximum achievable per input/program. This is to ensure that performance is comparable across programs, since different programs might exhibit different numbers of operations.

5.2 Principal Component Analysis

Given that a kNN model works best with a small number of features, we use PCA (Principal Component Analysis) to reduce the dimensionality of the feature space. Prior to applying PCA, the features are centered and reduced with a mean of 0 and a standard deviation of 1. This step is necessary since our features have very different ranges of values. PCA is then applied and we retain the principal components that explain 95% of the variance. In effect, this compresses the feature space by removing redundant features.

5.3 K-Nearest Neighbors Model

A k-nearest neighbors model makes a prediction of a new data point by finding the k closest points to it, using Euclidean distance and averaging their responses to make a prediction. In our case, the distance metric is determined by how close the feature vectors are from one another.

The kNN model does not require any special training. The execution time of rewritten Lift expressions, together with their features, are simply collected and added into a database. When predicting a newly unseen Lift program, we simply look up the k closest neighbors and average their prediction to form a new prediction. In our experiment, we used k = 5.

5.4 Making Predictions

To make a prediction about new programs, we first collect data points from a group of training programs. For each program, we conduct an exploration of their optimization space and store the features and corresponding performance. Given a new program, we proceed as follows:

1. For each rewritten program:
   a. The features are extracted, normalized and projected based on the PCA calculated from the training data;
   b. The model predicts the performance using the average of the k-nearest neighbors.

2. The different rewritten programs are sorted based on the prediction.

3. The fastest predicted rewritten program is generated, compiled and executed.

6 Experimental Setup

Platform The setup consists of two GPUs, an NVIDIA Titan Black and an AMD Radeon R9 295X2. The Nvidia platform uses driver version 367.35 and OpenCL 1.2 (CUDA 8.0.0). The AMD platform uses OpenCL 2.0 AMD-APP (1598.5).

Benchmarks and Space We use the 2D stencil benchmarks from [9] listed in Table 2. All experiments are performed using single floating point with matrix sizes from $512^2$ to $8192^2$.

Model evaluation The performance model is evaluated using leave-one-out cross-validation, the standard machine learning methodology. When evaluating performance on a given benchmark, the training data consists of all the data collected from all benchmarks, except the one being tested.
We use the redundancy metric to analyze which features are the most informative about performance: 

\[ R = \frac{I(X, Y)}{H(X) + H(Y)} \]

The redundancy metric normalizes the mutual information by the sum of the entropy of the two variables. This ensures that different features can be compared with one another. In our case, we are interested in comparing each feature with the output we wish to predict: performance. A higher value between a certain feature and the output indicates that the feature is useful for performance prediction.

Figure 4 shows the normalized mutual information between features and performance. As expected, one of the most important features is the average number of cache lines accessed per warp. This feature, which represents locality, is extremely important for stencil benchmarks.

The next most important feature for both machines is the global-Space in dimension 1. This feature is directly related to the number of threads that execute and, therefore, the amount of parallel work performed. It is also used to determine if the kernels are launched using a 2D or 1D iteration space (in the 1D case, the globalSize1 will be 1). Then, comes the number of global stores, followed closely by the number of global loads. This basically corresponds to the number of memory accesses performed into the slow global memory.

For both platforms, barriers and control flow (for loops) seem to have only a medium impact on performance, whereas the number of if-statements does not seem very relevant at all. Focusing on the least important features, the number of local loads does not seem to affect performance much. We conjecture that, since local memory is very fast, having fewer or more local loads might not make much of a difference in terms of performance, especially compared to the number of global memory operations.

### 7.2 Benchmark diversity

Figure 3 shows the features of the best point in the space for our benchmarks. As can be seen, some benchmarks share similarities, which is essential for being able to make prediction. However, we also observe quite a lot of diversity.

### 7.3 Performance Model Correlation

We analyze correlation between the predicted and actual values to measure the model ability at distinguishing between good and bad points. For all programs, the correlation coefficient is in the range \([0.7 - 0.9]\), with average of 0.9 on Nvidia and 0.8 on AMD, which shows the predictor works adequately.

### 7.4 Summary

This section has shown that the most important features for performance prediction on GPUs are related to memory access pattern, amount of parallelism, and number of global memory accesses. The section has also shown that the model’s predictions correlate highly with actual performance. The next section shows how the model is used to speedup the optimization space exploration of our benchmarks.

### 8 Optimization Space Exploration

#### 8.1 Model-based Exploration

This section shows the performance achieved when exploring the space with our predictor. The exploration is conducted by generating 1,000 transformed Lift expression using rewrites and combining them with 10 different thread-counts on average. This leads to 10,000 design points per program/input pair. For each point, we extract the features and use the model to rank them. We then run the design points from highest predicted performance to lowest.

Figure 5 shows the normalized best performance achieved as a function of the number of points evaluated. It also shows the performance achieved using a purely random evaluation order. Using the predictor, it is possible to very quickly achieve 100% of the performance available in the space for all programs. In comparison, the random strategy struggles to reach even 50% of the performance available in some cases after having explored 3% of the whole space.

#### 8.2 Space Exploration Speedups

Figure 6 shows the exploration speedup when using our model compared to random to achieve 90% of the available performance. The speedup is shown in terms of number of samples and total time it takes to run them. A speedup of 10\(x\) means the performance model needs 10\(x\) less runs, or 10\(x\) less time, than random to achieve 90% of the performance.

As can be seen, using the performance model brings large speedup across all programs. When looking at the total number of runs required, on Nvidia, the performance model approach requires 35\(x\) less runs than random. On AMD, there is an even bigger saving: the model requires 77\(x\) less runs than random. When it comes to total time, the model-based approach is a staggering 450\(x\) faster than random for Nvidia and 2000\(x\) faster than random for AMD respectively.

#### 8.3 Detailed Results

This section shows more details per program/input. Figure 7 shows the actual number of runs required to reach 90% of the performance across programs and input. As can be seen, only one run is necessary in the majority of the cases for Nvidia and two runs for AMD. In contrast, random needs over 60 runs for Nvidia and over 180 for AMD in most cases.

The average number of runs using our model is 3 for Nvidia and 5 for AMD. In comparison, random requires on average 97 runs.
for Nvidia and 240 for AMD. These results clearly show that the
performance model is working well in the majority of cases.

Interestingly, there are a couple of outliers programs/input size
combination that require over 30 runs for the model-based approach.
In both cases, stencil2d on Nvidia and srad1 on AMD, this is when
the largest or smallest input sizes are used. We believe that in such
cases, the behavior of these programs probably changes drastically
with the input size. For instance, the data might actually fit entirely
in the cache for the smallest input size of stencil2d and, therefore,
change drastically the behavior of the application for this input size.
Since our features have no notion of working-set size, the model
might be unable to pick up this change of behavior. However, even
in such cases, the model-based exploration is still ahead of random.
For stencil2d, the model needs 31 runs while random needs 691, a
21x speedup!

9 Related Work

Auto-Tuning  OpenTuner [1] is a framework for domain-specific
multi-objective auto-tuners. CLTune [25] is a generic auto-tuner for
OpenCL kernels. ATP [31] is a language-independent auto-tuning
framework which supports inter-parameter constraints. These auto-
tuning approaches attempt to find good implementations using
online search which is orthogonal to our approach. In fact, auto-
tuners can be easily coupled with a performance predictor.

Analytical Performance Modelling  CuMAPz [16] is a compile
time analysis tool that helps programmers increase the memory per-
fomance of CUDA programs. It estimates the effects of performance-
critical memory behaviors, such as data reuse, coalesced accesses,
channel skew, bank conflict and branch divergence. GROPEY [23]
uses the MWP-CWP model [12] (Memory Warp Parallelism – Com-
putation Warp Parallelism) to estimate the GPU performance of
skelton-based applications. GPUPerf [33] is an enhanced version
of the analytical MWP-CWP model with added metrics and a way of
understanding performance bottlenecks. The boat hull model [26]
is a modified version of the roofline model based on an algorithm
classification and produces a roofline model for each class of device.

GPU cache models [27] have been built by extending reuse dis-
tance theory with parallel execution, memory latency, limited associ-
vativity, miss-status holding-registers and warp divergence. COM-
PASS [18] introduces a language for creating analytical performance
models that analyze the amount of floating point and memory op-

Figure 4. Normalized mutual information (redundancy) between
each feature and performance.

Statistical Performance Modelling  Early work [8] extracts static
code features and uses machine learning to predict the performance
of optimization sequences. Principal component analysis, cluster
analysis and regression modelling have been used [15] to generate
predictive models for GPUs and CPUs. Predictive modelling
has also been applied in polyhedral compilation [29] to predict
speedups for different combinations of polyhedral transformations
from hardware performance counters. Graph-based program char-
acterization [28] has also been used for polyhedral compilation
to predict the speedups of optimization sequences. Clustering on
similarity of a graph-based intermediate representation has been
used [7] to cluster similar programs. Another approach [36] uses
machine learning models trained on assembly-level features to
choose a good combination of transformations for vectorization.
High-Level Feature Extraction for Performance Prediction

GPGPU '20, February 23, 2020, San Diego, CA, USA


These approaches try to directly predict the effect tunable parameters have on the performance. However, they rely on the fact that the number of parameters is fixed and known in advance. In contrast, our approach predicts the performance independently of the number of parameters in the program.

Artificial Intelligence for Compilers Genetic programming has been used [17] to generate features for predicting loop unrolling factors. Others [24] have proposed ways of generating program features out of simple ones. Features are encoded as numeric relations and new ones are generated by joining existing relations and aggregating them. TVM [5] uses machine learning to prune the search space for compilation optimizations.

Support Vector Machines have also been used in compilers [32]. Machine learning has also been used to automatically learn compiler heuristics. [37] A neural-network cascade [21] is used to determine the amount of thread coarsening to apply to OpenCL programs for different GPUs.

Machine learning models in compilers traditionally use features extracted from a deeper stage in the compilation pipeline. Our work instead extracts them at a considerably higher-level from a functional IR.

10 Conclusions

This paper has demonstrated that it is possible to extract low-level hardware-specific features from the Lift high-level functional IR. We have shown how type information, such as array length, is useful for computing certain features. The ability to reason symbolically about array indices also enables the extraction of very fined-grained features such as the number of accessed cache lines per warp. To the best of our knowledge, this is the first time a paper has shown how low-level features can be extracted at such high level, without requiring any profiling or performance counters.

The paper has also demonstrated how a simple performance model is built to make accurate performance predictions about different program variants. Using an Nvidia and AMD GPU, and stencil applications, we have shown that our model is able to predict points in the search space that are within 90% of the best within one or two runs in the majority of the cases. When compared to a random search strategy, the model requires on average 77x less runs than random on AMD and 35x less on Nvidia, which translates to time savings of 2000x and 450x respectively.
References


[27] Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri E. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In HPCA. IEEE. https://doi.org/10.1109/HPCA.2014.6835955


