268 lines
12 KiB
Markdown
268 lines
12 KiB
Markdown
# Gemm Tuner
|
|
|
|
## Introduction
|
|
|
|
This is a set of tools for tuning the performance of OpenCL GEMM kernels. Specifically, we tune 3 GEMM kernels, each
|
|
has a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**.
|
|
The details of these strategies can be found in the documentations of the corresponding kernels:
|
|
**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and
|
|
**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**.
|
|
|
|
The Tuner consists of 2 scripts and 3 binaries:
|
|
* benchmark_gemm_examples.sh and GemmTuner.py under examples/gemm_tuner, and
|
|
* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under
|
|
build/tests/gemm_tuner (you'll need to build the library first)
|
|
|
|
The inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly
|
|
data type). They define the "shape" and other parameters (eg. data type) of a GEMM operation:
|
|
```
|
|
LHS x RHS = DST
|
|
```
|
|
Where LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size.
|
|
|
|
The outputs of the tuning process are 4 json files:
|
|
1. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam
|
|
2. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam
|
|
3. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam
|
|
4. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam
|
|
|
|
These 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam,
|
|
what kernel and subsequently what configurations for that kernels are the most performant.
|
|
|
|
## Step-by-step example
|
|
|
|
### Step1: Prepare the shape and configs files
|
|
1. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*.
|
|
2. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires
|
|
some prior heuristics, but can be provided by the Compute Library developers upon requests, based on your target device).
|
|
|
|
Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv".
|
|
|
|
Please refer to the Prerequisite section for more details
|
|
|
|
### Step2: Push relevant files to the target device
|
|
All the files that need to be present on the target device are:
|
|
* benchmark script: \<ComputeLibrary\>/examples/gemm_tuner/benchmark_gemm_examples.sh
|
|
* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv
|
|
* Example benchmark binaries: \<ComputeLibrary\>/build/tests/gemm_tuner/benchmark_cl_gemm*
|
|
|
|
### Step3: Collect benchmark data
|
|
With these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed
|
|
to a folder called *gemm_tuner*. While logged onto our device:
|
|
```
|
|
# Native
|
|
./benchmark_gemm_examples.sh -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native
|
|
# Reshaped Only RHS
|
|
./benchmark_gemm_examples.sh -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs
|
|
# Reshaped
|
|
./benchmark_gemm_examples.sh -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped
|
|
```
|
|
You can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy),
|
|
but you may need to change the output folder for each repeat
|
|
|
|
### Step4: Generate the heuristics
|
|
1. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine
|
|
2. We use the GemmTuner.py script to give us the heuristics
|
|
```
|
|
python3 <ComputeLibrary>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics
|
|
```
|
|
When it's finished, there should be 4 json files in the *heuristics* folder
|
|
|
|
One thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because
|
|
we accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by
|
|
passing a lower value to *-t \<tolerance\>* to the GemmTuner.py script.
|
|
|
|
## Prerequisite
|
|
* A target device to be tuned, plus the following on the device:
|
|
* Android or Linux OS
|
|
* Bash shell
|
|
* Built Compute Library with benchmark examples binaries
|
|
* benchmark_gemm_examples.sh script
|
|
* gemm shape file
|
|
|
|
A csv file containing the **GEMMParam search list**. This is the list of GEMMParams/gemm shapes that we're
|
|
interested in (For more details see Approach section). The default list is prepared by Compute Library developers in advance
|
|
and can be provided on request.
|
|
|
|
The format is described as:
|
|
|
|
A headerless csv file with fields separated by commas.
|
|
|
|
A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and
|
|
RHS) with:
|
|
|
|
M - Number of lhs matrix rows
|
|
N - Number of rhs matrix columns
|
|
K - Number of lhs matrix columns/rhs matrix rows
|
|
B - Batch size
|
|
|
|
An example gemm shape file looks like:
|
|
```
|
|
100,100,30,1
|
|
100,100,30,3
|
|
...
|
|
```
|
|
* gemm config file
|
|
A csv file containing the **GEMMConfig search list**. This is the list of candidate GEMMConfigs among which we
|
|
search for the optimal one. **Note that we have a different list for each strategy.**
|
|
The default lists are prepared by Compute Library developers in advance and can be provided on request.
|
|
|
|
The format of the file for each strategy is the same:
|
|
|
|
A headerless csv file with fields separated by commas.
|
|
|
|
However the fields of GEMMConfig differ for each strategy:
|
|
|
|
* Strategy **native**:
|
|
A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with:
|
|
|
|
m0 - Number of rows processed by the matrix multiplication
|
|
n0 - Number of columns processed by the matrix multiplication
|
|
k0 - Number of partial accumulations performed by the matrix multiplication
|
|
|
|
Only the following configurations of M0, N0 and K0 are currently supported:
|
|
|
|
M0 = 1, 2, 3, 4, 5, 6, 7, 8
|
|
N0 = 2, 3, 4, 8, 16
|
|
K0 = 2, 3, 4, 8, 16
|
|
|
|
An example gemm config file looks like:
|
|
```
|
|
1,4,4
|
|
2,3,8
|
|
...
|
|
```
|
|
* Strategy **reshaped_rhs_only**:
|
|
A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values:
|
|
|
|
m0 - Number of rows processed by the matrix multiplication
|
|
n0 - Number of columns processed by the matrix multiplication
|
|
k0 - Number of partial accumulations performed by the matrix multiplication
|
|
h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
|
|
interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
|
|
transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0)
|
|
export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
|
|
with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
|
|
for more details
|
|
|
|
Only the following configurations of M0, N0 and K0 are currently supported:
|
|
|
|
M0 = 1, 2, 3, 4, 5, 6, 7, 8
|
|
N0 = 2, 3, 4, 8, 16
|
|
K0 = 2, 3, 4, 8, 16
|
|
H0 >= 1
|
|
|
|
An example gemm config file looks like:
|
|
```
|
|
4,4,4,1,1,1,0
|
|
4,4,4,3,1,0,1
|
|
...
|
|
```
|
|
* Strategy **reshaped**:
|
|
A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values:
|
|
|
|
m0 - Number of rows processed by the matrix multiplication
|
|
n0 - Number of columns processed by the matrix multiplication
|
|
k0 - Number of partial accumulations performed by the matrix multiplication
|
|
v0 - Number of vertical blocks of size (m0xk0) stored on the same output row
|
|
h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
|
|
interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0)
|
|
interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
|
|
transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0)
|
|
export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
|
|
with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
|
|
for more details
|
|
|
|
If rhs matrix is transposed only the following configurations are currently supported:
|
|
|
|
M0 = 2, 3, 4, 5, 6, 7, 8
|
|
N0 = 2, 3, 4, 8, 16
|
|
K0 = 2, 3, 4, 8, 16
|
|
V0 >= 1
|
|
H0 >= 1
|
|
|
|
If lhs matrix is transposed only the following configurations are currently supported:
|
|
|
|
M0 = 2, 3, 4, 8
|
|
N0 = 2, 3, 4, 8, 16
|
|
K0 = 2, 3, 4, 8, 16
|
|
V0 >= 1
|
|
H0 >= 1
|
|
|
|
An example gemm config file looks like:
|
|
```
|
|
4,4,4,1,3,1,1,1,0
|
|
4,4,4,3,3,1,1,0,1
|
|
...
|
|
```
|
|
* A host machine, plus these on the machine:
|
|
* python >= 3.6
|
|
* GemmTuner.py script
|
|
|
|
## Usage
|
|
The usage of the 2 scripts:
|
|
|
|
1. benchmark_gemm_examples.sh
|
|
|
|
Run the shell script (**benchmark_gemm_examples.sh**) on your **target device**. Note that all the built benchmark
|
|
examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running.
|
|
The benchmark results will be saved to json files in an output directory.
|
|
```
|
|
Usage: benchmark_gemm_examples.sh [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\>
|
|
-c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>]
|
|
|
|
Options:
|
|
-h
|
|
Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that
|
|
strategy. Otherwise if no strategy is specified, display messages for all available strategies.
|
|
|
|
-s <strategy>
|
|
Strategy option.
|
|
Options: ${ALL_STRATEGY_OPTIONS[@]}.
|
|
|
|
-e <example_binary_dir>
|
|
Path to directory that holds all example binaries
|
|
|
|
-g <gemm_shape_file>
|
|
Path to gemm shape csv file
|
|
|
|
-c <gemm_config_file>
|
|
Path to gemm config csv file
|
|
|
|
-d <data_type>
|
|
Data type option with which to run benchmark examples
|
|
Default: ${DEFAULT_DATA_TYPE}
|
|
Supported options:
|
|
Strategy : Data Types
|
|
Native : F32
|
|
Reshaped : F16, F32
|
|
Reshaped RHS Only : F16, F32
|
|
|
|
-o <out_dir>
|
|
Path to output directory that holds output json files
|
|
Default: ${DEFAULT_OUT_DIR}
|
|
```
|
|
2. GemmTuner.py:
|
|
|
|
Run the python script (**GemmTuner.py**) on your **host machine**.
|
|
You'll need to transfer all the benchmark result json files generated from the previous step to your host machine
|
|
beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files
|
|
```
|
|
Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D]
|
|
|
|
CL GEMM Tuner
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
-b PATH, --benchmark_results PATH
|
|
Path to benchmark result directory, where benchmark
|
|
result json files have a file extension of
|
|
'gemmtuner_benchmark'
|
|
-o PATH, --output_dir PATH
|
|
Path to directory that holds output json files.
|
|
-t TOLERANCE, --tolerance TOLERANCE
|
|
For testing if two GEMMConfigs are equivalent in terms
|
|
of performance. The tolerance is OpenCL timer in
|
|
milliseconds. Recommended value: <= 0.1 ms
|
|
-D, --debug Enable script debugging output
|
|
|
|
``` |