在 CFU Playground 上加速 MLPerf™ Tiny 影像分類模型 #5：純軟體加速

30 min readJan 8, 2024

在開始加上應用硬體加速前，先讓我們看看純軟體改進能做到多快的加速，畢竟在軟體上進行最佳化的難度和時間花費總是比寫個硬體加速器來得簡單，而且也能讓我們在之後看到需要的運算能夠被硬體加速的寶貴。

根據 Amdahl 定律可知性能提升取決於我們改進的部分佔整個模型運算的佔比，以及我們的改進能取得多少加速，因此 ConvPerChannel 這個實際進行卷積的函數將會是最佳化的重中之重，除此之外，因為 Add 和 FullyConnected 等兩個函數也會用到服務於 ConvPerChannel 的一些邏輯，因此會順帶改進一下，具體來說會改動的檔案如下。

src/tensorflow/lite/kernels/internal/reference/integer_ops/add.h
src/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h
src/tensorflow/lite/kernels/internal/reference/integer_ops/fully_connected.h

接下來給出的時間為 200 張測試資料的平均推理時間，使用的 CPU 為 VexRiscv_FullCfu @ 75MHz，讀者可以透過簡單的算術轉換為週期數。

乾淨版本：4044074.545 微秒

5.1 參數特化

因為本文只專注於加速 MLPerf™ Tiny 影像分類模型，因此可以針對模型會使用的參數值來特化使用到的程式碼，減少潛在的多餘計算。

這是 TFLM 的 FullyConnected 程式碼片段，如果讀者將該層的參數印出來會發現 batches 的值是 1，因此若將 1 代入，編譯器將能夠在此基礎上把迴圈拔掉，不過筆者建議能自己做的就自己做，這樣保證這個迴圈一定會不見，按照這個方法我們可以將相關的值全部代進去，方便編譯器或我們自己進行最佳化。

/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    <http://www.apache.org/licenses/LICENSE-2.0>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
=======================================================================*/

// Code Snippet of the FullyConnected Layer

inline void FullyConnected(
    const FullyConnectedParams& params, const RuntimeShape& input_shape,
    const int8_t* input_data, const RuntimeShape& filter_shape,
    const int8_t* filter_data, const RuntimeShape& bias_shape,
    const int32_t* bias_data, const RuntimeShape& output_shape,
    int8_t* output_data) {
  const int32_t input_offset = params.input_offset;
  const int32_t filter_offset = params.weights_offset;
  const int32_t output_offset = params.output_offset;
  const int32_t output_multiplier = params.output_multiplier;
  const int output_shift = params.output_shift;
  const int32_t output_activation_min = params.quantized_activation_min;
  const int32_t output_activation_max = params.quantized_activation_max;
  TFLITE_DCHECK_GE(filter_shape.DimensionsCount(), 2);
  TFLITE_DCHECK_GE(output_shape.DimensionsCount(), 1);
  TFLITE_DCHECK_LE(output_activation_min, output_activation_max);
  const int filter_dim_count = filter_shape.DimensionsCount();
  const int output_dim_count = output_shape.DimensionsCount();
  const int batches = FlatSizeSkipDim(output_shape, output_dim_count - 1);
  const int output_depth = output_shape.Dims(output_dim_count - 1);
  TFLITE_DCHECK_LE(output_depth, filter_shape.Dims(filter_dim_count - 2));
  const int accum_depth = filter_shape.Dims(filter_dim_count - 1);
  for (int b = 0; b < batches; ++b) {
    for (int out_c = 0; out_c < output_depth; ++out_c) {
      int32_t acc = 0;
      for (int d = 0; d < accum_depth; ++d) {
        int32_t input_val = input_data[b * accum_depth + d];
        int32_t filter_val = filter_data[out_c * accum_depth + d];
        acc += (filter_val + filter_offset) * (input_val + input_offset);
      }
      if (bias_data) {
        acc += bias_data[out_c];
      }
      acc = MultiplyByQuantizedMultiplier(
                acc, output_multiplier, output_shift);
      acc += output_offset;
      acc = std::max(acc, output_activation_min);
      acc = std::min(acc, output_activation_max);
      output_data[out_c + output_depth * b] = static_cast<int8_t>(acc);
    }
  }
}

另外可以看到裡面有許多對於參數合法性的檢查，不過因為使用的模型肯定是對的，所以我們能將其拿掉省一點點週期。

按此方法經過筆者最佳化過後的程式碼如下，讀者可以試著將 ConvPerChannel 和 Add 也用同樣的方法修改。

/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.
   Copyright 2023-2024 Chung-Yi Chen (Yeecy). All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    <http://www.apache.org/licenses/LICENSE-2.0>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
=======================================================================*/

// Code Snippet of the Specialized FullyConnected Layer

inline void FullyConnected(
    const FullyConnectedParams& params, const RuntimeShape& input_shape,
    const int8_t* input_data, const RuntimeShape& filter_shape,
    const int8_t* filter_data, const RuntimeShape& bias_shape,
    const int32_t* bias_data, const RuntimeShape& output_shape,
    int8_t* output_data) {
  for (int out_c = 0; out_c < 10; out_c++) {
    int32_t acc = bias_data[out_c];
    for (int d = 0; d < 64; d++) {
      acc += (input_data[d] + 128) * filter_data[out_c*64 + d];
    }
    acc = MultiplyByQuantizedMultiplier(acc, 1552512742, -5);
    acc += 24;
    acc = std::max(acc, (int32_t)-128);
    acc = std::min(acc, (int32_t)127);
    output_data[out_c] = acc;
  }
}

將 ConvPerChannel、Add 和 FullyConneted 特化過後能夠得到顯著的速度提升。

參數特化：2366758.44 微秒，約為 1.708 倍加速

5.2 分支提示與範圍判斷改進

以下是範圍判斷的原始碼，我們可以看到它包含了一次分支判斷，通常而言我們可以預期處理器的分支預測單元能有很好的表現，不過考量到這裡使用的 CPU 表現肯定比不上我們日常使用的處理器，因此可以嘗試看看給些分支提示看看能不能改善表現。經筆者測試，將分支給上 [[unlikely]] 屬性能夠帶來收益，這樣的方法可以叫做基於人工效能分析（profiling）的最佳化（這只是玩笑，別當真）。

/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.
   Copyright 2023-2024 Chung-Yi Chen (Yeecy). All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    <http://www.apache.org/licenses/LICENSE-2.0>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
=======================================================================*/

// Code Snippet of the Range Check Logic

const bool is_point_inside_image =
    (in_x >= 0) && (in_x < input_width) && (in_y >= 0) &&
    (in_y < input_height);
if (!is_point_inside_image) [[unlikely]] {
  continue;
}

分支提示：2333546.375 微秒，約為 1.014 倍加速

在前面特化中筆者將 ConvPerChannel 依據卷積核尺寸和步長分成三個特化版本，第一種為步長為 1 以及卷積核大小為 3×3，第二種為步長為 2 以及卷積核大小為 3×3，第三種為步長為 2 以及卷積核大小為 1×1。

對於後兩種而言，讀者若有仔細觀察的話會發現理論上它們並不會有越界存取的情況，所以可以不用做範圍判斷，但可能是因為實作上簡單起見，實際輸出大小會比按定義算出的大小還大，因此不能將範圍判斷邏輯刪除，不過好消息是因為這兩種版本不會有小於 0 的情況發生，我們能將 in_x >= 0 和 in_y >= 0 的計算省略掉，將四次比較減少為兩次比較，而第一種版本因為填充值為 1，因此可能會有小於 0 的情況發生，因此四次比較無法避免。

真的沒辦法嗎？讓我們觀察一下 (in_x >= 0) && (in_x < input_width) && (in_y >= 0) && (in_y < input_height)，首先可以確定 input_width 的範圍在 0 到 2147483647 內，也就是說其最高位必然為 0，注意到如果 in_x 小於 0 的話，它的最高位之值為 1，那麼有趣的事情來了，如果 in_x >= 0 或 in_x < input_width 不成立，那麼 (uint32_t)in_x < (uint32_t)input_width 不成立（in_x 若小於 0，轉型後會使得其值大於 2147483647，導致 in_x < input_width 不成立）；如果 in_x >= 0 和 in_x < input_width 成立，那麼 (uint32_t)in_x < (uint32_t)input_width 成立，意即 (in_x >= 0) && (in_x < input_width) 與 (uint32_t)in_x < (uint32_t)input_width 等價，套用這個發現，我們能將四次比較也轉換為兩次比較。另外在 RISC-V 上這樣的轉型後比較只會被翻譯成一條 SLTU 指令，不用擔心轉型造成的額外開銷。

總結來說對於三種版本，我們都能將其範圍判斷從四次比較減少到兩次比較，帶來加速。

範圍判斷改進：2297518.38 微秒，約為 1.015 倍加速

5.3 Im2col 與分塊矩陣乘法

現在開始把卷積層按照第二章提到的方法轉換為矩陣乘法，此處筆者採用轉置卷積核矩陣的做法，也就是將第二章提供的卷積核程式碼中的 i 和 j 對調，以省去矩陣乘法前初始化結果矩陣所需要的時間，這裡附上前面提到的第一種版本（步長為 1 且卷積核大小為 3×3）的計算程式碼，讀者觀察一下矩陣乘法的寫法應該就能知道省去初始化結果矩陣是什麼意思了。

// Copyright (c) 2023-2024 Chung-Yi Chen (Yeecy)
// Im2col and GEMM

const int M = H_output * W_output;
const int K = H_kernel * W_kernel * C_input;
const int N = C_output;

int8_t im2col[150000];
int8_t kernel[150000];
int32_t result[150000];

// Construct im2col matrix
for (int out_y = 0; out_y < H_output; out_y++) {
  for (int out_x = 0; out_x < W_output; out_x++) {
    for (int filter_y = 0; filter_y < 3; filter_y++) {
      for (int filter_x = 0; filter_x < 3; filter_x++) {
        for (int in_channel = 0; in_channel < C_input; in_channel++) {
          const int j = in_channel*9 + filter_y*3 + filter_x;
          const int i = out_y*W_output + out_x;
          const int in_y = out_y + filter_y - 1;
          const int in_x = out_x + filter_x - 1;

          const bool is_point_inside_image =
              (uint32_t)in_y < (uint32_t)(H_input) &&
              (uint32_t)in_x < (uint32_t)(W_input);
          
          im2col[i*K + j] = is_point_inside_image
              ? input_data[Offset(input_shape, 0, in_y, in_x, in_channel)]
              : -128;
        }
      }
    }
  }
}

// Construct kernel matrix
for (int out_channel = 0; out_channel < C_output; out_channel++) {
  for (int filter_y = 0; filter_y < 3; filter_y++) {
    for (int filter_x = 0; filter_x < 3; filter_x++) {
      for (int in_channel = 0; in_channel < C_input; in_channel++) {
        const int j = in_channel*9 + filter_y*3 + filter_x;
        const int i = out_channel;
        kernel[i*K + j] = filter_data[Offset(filter_shape, out_channel,
            filter_y, filter_x, in_channel)];
      }
    }
  }
}

// Naive GEMM
for (int m = 0; m < M; m++) {
  for (int n = 0; n < N; n++) {
    int32_t acc = 0;

    for (int k = 0; k < K; k++) {
      acc += (im2col[m*K + k] + 128) * kernel[n*K + k];
    }

    result[m*N + n] = acc;
  }
}

// Write back to output_data
for (int out_channel = 0; out_channel < C_output; out_channel++) {
  for (int out_y = 0; out_y < H_output; out_y++) {
    for (int out_x = 0; out_x < W_output; out_x++) {
      const int i = out_y*W_output + out_x;
      const int j = out_channel;
      int32_t acc = result[i*N + j];
      // post-processing logic is omitted here
    }
  }
}

Im2col 與矩陣乘法：2273045.74 微秒，約為 1.010 倍加速

接下來再把矩陣乘法分塊一下以得到更好的快取表現，筆者發現當分塊大小為 8 時有著最好的表現。

// Copyright (c) 2023-2024 Chung-Yi Chen (Yeecy)
// Tiled GEMM

for (int m = 0; m < pad_M; m += 8) {
  const int MM = std::min(m + 8, M) - m;

  for (int n = 0; n < pad_N; n += 8) {
    const int NN = std::min(n + 8, N) - n;

    for (int mm = 0; mm < MM; mm++) {
      for (int nn = 0; nn < NN; nn++) {
        int32_t acc = 0;

        for (int k = 0; k < K; k++) {
          acc += (im2col[(m + mm)*K + k] + 128) * kernel[(n + nn)*K + k];
        }

        result[(m + mm)*N + n + nn] = acc;
      }
    }
  }
}

Im2ocl 與分塊矩陣乘法：2209699.635 微秒，約為 1.028 倍加速

最後嘗試一下展開最內層迴圈，如果把每一層的 K 都印出來，會發現除了第一層為 27 外其餘所有層的值的最大公因數為 16，因此用 16 作為展開次數是個合理的選擇，筆者實際測試也發現展開 16 次確實能帶來最好的表現。

迴圈展開：1446542.495 微秒，約為 1.527 倍加速

5.4 gemmlowp 函數改進

基本上到了這裡能改動的都改動了，大概只剩下 ConvPerChannel 和 FullyConnected 用到的 MultiplyByQuantizedMultiplier 以及 Add 用到的 MultiplyByQuantizedMultiplierSmallerThanOneExp 等兩個函數可能有改進的機會，觀察下面的函數實作，可以發現兩者都用到了兩個 gemmlowp 函式庫裡的函數 RoundingDivideByPOT 和 SaturatingRoundingDoublingHighMul，為了簡單起見，將它們分別縮寫為 RDBPOT 和 SRDHM。

/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    <http://www.apache.org/licenses/LICENSE-2.0>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
=======================================================================*/

inline int32 MultiplyByQuantizedMultiplier(int32 x,
                                           int32 quantized_multiplier,
                                           int shift) {
  using gemmlowp::RoundingDivideByPOT;
  using gemmlowp::SaturatingRoundingDoublingHighMul;
  int left_shift = shift > 0 ? shift : 0;
  int right_shift = shift > 0 ? 0 : -shift;
  return RoundingDivideByPOT(SaturatingRoundingDoublingHighMul(
                                 x * (1 << left_shift),
                                 quantized_multiplier),
                             right_shift);
}

inline int32 MultiplyByQuantizedMultiplierSmallerThanOneExp(
    int32 x, int32 quantized_multiplier, int left_shift) {
  using gemmlowp::RoundingDivideByPOT;
  using gemmlowp::SaturatingRoundingDoublingHighMul;
  return RoundingDivideByPOT(
      SaturatingRoundingDoublingHighMul(x, quantized_multiplier),
          -left_shift);
}

接下來要做的事情是看看 RDBPOT 和 SRDHM 的實作，老實說筆者每次看到下面的程式碼都會有種莫名的厭惡感，不知道讀者有沒有這樣的感覺呢？

// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     <http://www.apache.org/licenses/LICENSE-2.0>
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

template <typename IntegerType, typename ExponentType>
inline IntegerType RoundingDivideByPOT(IntegerType x,
                                       ExponentType exponent) {
  assert(exponent >= 0);
  assert(exponent <= 31);
  const IntegerType mask = Dup<IntegerType>((1ll << exponent) - 1);
  const IntegerType zero = Dup<IntegerType>(0);
  const IntegerType one = Dup<IntegerType>(1);
  const IntegerType remainder = BitAnd(x, mask);
  const IntegerType threshold =
      Add(ShiftRight(mask, 1), BitAnd(MaskIfLessThan(x, zero), one));
  return Add(ShiftRight(x, exponent),
             BitAnd(MaskIfGreaterThan(remainder, threshold), one));
}

template <>
inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
                                                      std::int32_t b) {
  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
  std::int64_t a_64(a);
  std::int64_t b_64(b);
  std::int64_t ab_64 = a_64 * b_64;
  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
  std::int32_t ab_x2_high32 =
      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
  return overflow ? std::numeric_limits<std::int32_t>::max()
                  : ab_x2_high32;
}

考慮到筆者寫文章並不是想要折磨自己，下面直接給出改進過後的版本，要一步步寫出改進的過程實在是太折磨人了。

// Copyright (c) 2023-2024 Chung-Yi Chen (Yeecy)
// Faster Implementation of RDBPOT and SRDHM

inline int32_t RDBPOT(int32_t x, int32_t exp) {
  const int32_t mask = (1 << exp) - 1;
  const int32_t remainder = x & mask;
  const int32_t threshold = (mask >> 1) + ((x >> 31) & 1);
  return (x >> exp) + (remainder > threshold);
}

inline int32_t SRDHM(int32_t a, int32_t b) {
  const bool overflow = *(uint32_t*)&a == 0x80000000
                     && *(uint32_t*)&b == 0x80000000;
  const int64_t ab_64 = (int64_t)a * (int64_t)b;
  const int32_t nudge = (ab_64 >> 63) & 1 ? 0xc0000001 : 0x40000000;
  const int64_t ab_64_nudge = ab_64 + nudge;
  return overflow ? 0x7fffffff :
    (ab_64_nudge >> 63) & 1 ? -(-ab_64_nudge >> 31) : ab_64_nudge >> 31;
}

接下來我們需要自行實作並修改 MultiplyByQuantizedMultiplier* 等兩個函數，讓它們使用上面的改進版本。

/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
   Copyright 2023-2024 Chung-Yi Chen (Yeecy). All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
=======================================================================*/

inline int32_t MultiplyByQuantizedMultiplier(int32_t x,
                                             int32_t quantized_multiplier,
                                             int shift) {
  int left_shift = shift > 0 ? shift : 0;
  int right_shift = shift > 0 ? 0 : -shift;
  return RDBPOT(SRDHM(x * (1 << left_shift), quantized_multiplier),
                right_shift);
}

inline int32_t MultiplyByQuantizedMultiplierSmallerThanOneExp(
    int32_t x, int32_t quantized_multiplier, int left_shift) {
  return RDBPOT(SRDHM(x, quantized_multiplier), -left_shift);
}

改進 RDBPOT 與 SRDHM：1434397.59 微秒，約為 1.008 倍加速

如果讀者理解這四個函數具體在做什麼的話，應該會發現有幾點可以改進的地方。

x * (1 << left_shift) 與 x << left_shift 等價
RDBPOT(y, 0) 與 y 等價
對於 MultiplyByQuantizedMultiplierSmallerThanOneExp，手動把 left_shift 加上負號，減少一次減法

總之改進的程式碼如下，經過測試在 else 分支加上 [[unlikely]] 會有稍微好一點點的表現。

// Copyright (c) 2023-2024 Chung-Yi Chen (Yeecy)
// Faster Implementation of MultiplyByQuantizedMultiplier* functions

inline int32_t MultiplyByQuantizedMultiplier(int32_t x,
                                             int32_t quantized_multiplier,
                                             int shift) {
  if (shift < 0) {
    return RDBPOT(SRDHM(x, quantized_multiplier), -shift);
  }
  else [[unlikely]] {
    return SRDHM(x << shift, quantized_multiplier);
  }
}

inline int32_t MultiplyByQuantizedMultiplierSmallerThanOneExp(
    int32_t x, int32_t quantized_multiplier, int left_shift) {
  return RDBPOT(SRDHM(x, quantized_multiplier), left_shift);
}

改進 MultiplyByQuantizedMultiplier*：1432507.48 微秒，約為 1.001 倍加速

5.5 展開 FullyConnected

最後我們將 FullyConnected 的最內層迴圈展開 16 次，取得最後一點點的加速。

展開 FullyConnected：1428413.2 微秒，約為 1.002 倍加速

相比於原先的 4044074.545 微秒的推理時間，經過這些純粹軟體上的修修改改得到了 2.831 倍的加速，而其中最大的影響因素是矩陣乘法的改進，這也能告訴我們優先加速最耗時的計算能夠帶來比較顯著的整體收益。