Arm Mali Bifrost and Valhall OpenCL Developer Guide - Chapter 8 - Optimizing OpenCL for Mali GPUs

Arm Mali Bifrost and Valhall OpenCL Developer Guide - Version 4.0
https://developer.arm/documentation/101574/0400

8. Optimizing OpenCL for Mali GPUs

This chapter describes the procedure to optimize OpenCL applications for Mali GPUs.
本章介绍为 Mali GPU 优化 OpenCL 应用程序的步骤。

It contains the following sections:

  • 8.1 The optimization process for OpenCL applications on page 8-69. (OpenCL 应用程序的优化过程。)
  • 8.2 Load balancing between control threads and OpenCL threads on page 8-70. (控制线程和 OpenCL 线程之间的负载平衡。)
  • 8.3 Optimizing memory allocation on page 8-71. (优化内存分配。)

8.1 The optimization process for OpenCL applications

To optimize your application, you must first identify the most computationally intensive parts of your application. In an OpenCL application that means identifying the kernels that take the most time.
要优化你的应用程序,你必须首先确定应用程序中计算最密集的部分。在 OpenCL 应用程序中,这意味着识别花费时间最多的内核。

To identify the most computationally intensive kernels, you must individually measure the time taken by each kernel:
要确定计算量最大的内核,你必须分别测量每个内核花费的时间:

  • Measure individual kernels

Go through your kernels one at a time and: (一次遍历你的内核,然后:)

  1. Measure the time it takes for several runs. (测量几次运行所需的时间。)
  2. Average the results. (平均结果。)

It is important that you measure the runtime of the individual kernels to get accurate measurements.
重要的是,你必须测量各个内核的运行时间以获取准确的测量值。

Do a dummy run of the kernel the first time to ensure that the memory is allocated. Ensure this is outside of your timing loop.
第一次对内核进行虚拟运行,以确保分配了内存。确保这不在你的计时循环之内。

The allocation of some buffers in certain cases is delayed until the first time they are used. This can cause the first kernel run to be slower than subsequent runs.
在某些情况下,某些缓冲区的分配会延迟到第一次使用它们时。这可能会导致第一个内核运行比随后的运行慢。

  • Select the kernels that take the most time (选择花费时间最多的内核)

Select the kernels that have the longest runtime and optimize these. Optimizing any other kernels has little impact on overall performance.
选择运行时间最长的内核并对其进行优化。优化其他任何内核对整体性能几乎没有影响。

  • Analyze the kernels (分析内核)

Analyze the kernels to see if they contain computationally expensive operations: (分析内核,看看它们是否包含计算量大的操作:)

  1. Measure how many reads and writes there are in the kernel. For high performance, do as many computations per memory access as possible. (测量内核中有多少个读写。为了获得高性能,请为每个内存访问执行尽可能多的计算。)
  2. For Mali GPUs, you can use the Offline Shader Compiler to check the balancing between the different units. (对于 Mali GPU,你可以使用 Offline Shader Compiler 来检查不同单元之间的平衡。)
  • Measure individual parts of the kernel (测量内核的各个部分)

If you cannot determine the compute intensive part of the kernel by analysis, you can isolate it by measuring different parts of the kernel individually.
如果无法通过分析确定内核的计算密集型部分,则可以通过分别测量内核的不同部分来隔离它。

You can do this by removing different code blocks and measuring the performance difference each time.
你可以通过删除不同的代码块并每次测量性能差异来做到这一点。

The section of code that takes the most time is the most intensive.
花费时间最多的代码部分是最密集的。

  • Apply optimizations

Consider how the most intensive section of code can be rewritten and what optimizations apply.
考虑如何重写最密集的代码部分以及应用哪些优化。

Apply a relevant optimization.
应用相关的优化。

  • Check your results (检查结果)

Whenever you make changes to optimize your code, ensure that you measure the results so you can determine the optimization was successful. Many changes that are beneficial in one situation, might not provide any benefit, or even reduce performance under a different set of conditions.
每当进行更改以优化代码时,请确保测量结果,以便确定优化是否成功。在一种情况下有益的许多更改可能不会提供任何好处,甚至会在一组不同的条件下降低性能。

  • Reiterate the process

When you have increased the performance of your code with an optimization, measure it again to find out if there are other areas you can improve performance. There are typically several areas where you can improve performance so you might need to iterate the process many times to achieve optimal performance.
通过优化提高了代码的性能后,请再次对其进行衡量,以确定是否还有其他可以提高性能的方面。通常,你可以在几个方面提高性能,因此你可能需要多次迭代该过程才能获得最佳性能。

8.2 Load balancing between control threads and OpenCL threads (控制线程和 OpenCL 线程之间的负载平衡)

If you can, ensure that both control threads and OpenCL threads run in parallel.
如果可以,请确保控制线程和 OpenCL 线程并行运行。

This section contains the following subsections:

  • 8.2.1 Do not use clFinish() for synchronization on page 8-70. (不要使用 clFinish() 进行同步)
  • 8.2.2 Do not use any of the clEnqueueMap() operations with a blocking call on page 8-70. (不要将任何clEnqueueMap() 操作与阻塞调用一起使用)

8.2.1 Do not use clFinish() for synchronization

Sometimes the application processor must access data written by OpenCL. This process must be synchronized.
有时,应用程序处理器必须访问 OpenCL 编写的数据。此过程必须同步。

You can perform the synchronization with clFinish() but Arm recommends you avoid this if possible because it serializes execution. Calls to clFinish() introduce delays because the control thread must wait until all of the jobs in the queue to complete execution. The control thread is idle while it is waiting for this process to complete.
你可以使用 clFinish() 执行同步,但是 Arm 建议你避免这种情况,因为它会序列化执行。调用 clFinish() 会导致延迟,因为控制线程必须等待队列中的所有作业完成执行。控制线程在等待此过程完成时处于空闲状态。

Instead, where possible, use clWaitForEvents() or callbacks to ensure that the control thread and OpenCL can work in parallel.
相反,在可能的情况下,请使用 clWaitForEvents() 或回调以确保控制线程和 OpenCL 可以并行工作。

8.2.2 Do not use any of the clEnqueueMap() operations with a blocking call

Use clWaitForEvents() or callbacks to ensure that the control thread and OpenCL can work in parallel.
使用 clWaitForEvents() 或回调以确保控制线程和 OpenCL 可以并行工作。

Procedure

  1. Split work into many parts. (工作分为多个部分。)
  2. For each part: (对于每个部分:)
    a. Prepare the work for part X on the application processor. (在应用程序处理器上为 part X 准备工作。)
    b. Submit part X OpenCL work-items to the OpenCL device. (将 part X OpenCL 工作项提交给 OpenCL 设备。)
  3. For each part: (对于每个部分:)
    a. Wait for part X OpenCL work-items to complete on the OpenCL device using clWaitForEvents. (使用 clWaitForEvents 等待 part X OpenCL 工作项在 OpenCL 设备上完成。)
    b. Process the results from the OpenCL device on the application processor. (在应用程序处理器上处理来自 OpenCL 设备的结果。)

8.3 Optimizing memory allocation

You can optimize memory allocation by using the correct commands.
您可以使用正确的命令来优化内存分配。

This section contains the following subsections:

  • 8.3.1 About memory allocation on page 8-71.
  • 8.3.2 Use CL_MEM_ALLOC_HOST_PTR to avoid copying memory on page 8-72. (使用 CL_MEM_ALLOC_HOST_PTR 避免复制内存。)
  • 8.3.3 Do not create buffers with CL_MEM_USE_HOST_PTR if possible on page 8-72. (如果可能,不要使用 CL_MEM_USE_HOST_PTR 创建缓冲区。)
  • 8.3.4 Sharing memory between I/O devices and OpenCL on page 8-73. (在 I/O 设备和 OpenCL 之间共享内存。)
  • 8.3.5 Sharing memory in a fully coherent system on page 8-73. (在完全一致的系统中共享内存。)
  • 8.3.6 Sharing memory in an I/O coherent system on page 8-73. (在 I/O 一致的系统中共享内存。)

8.3.1 About memory allocation (关于内存分配)

To avoid making the copies, use the OpenCL API to allocate memory buffers and use map and unmap operations. These operations enable both the application processor and the Mali GPU to access the data without any copies.
为避免进行复制,请使用 OpenCL API 分配内存缓冲区,并使用映射和取消映射操作。这些操作使应用程序处理器和 Mali GPU 都可以访问数据而没有任何复制。

OpenCL originated in desktop systems where the application processor and the GPU have separate memories. To use OpenCL in these systems, you must allocate buffers to copy data to and from the separate memories.
OpenCL 起源于桌面系统,其中应用程序处理器和 GPU 具有单独的内存。要在这些系统中使用 OpenCL,必须分配缓冲区以将数据复制到单独的存储器中或从单独的存储器中复制数据。

Systems with Mali GPUs typically have a shared memory, so you are not required to copy data. However, OpenCL assumes that the memories are separate and buffer allocation involves memory copies. This is wasteful because copies take time and consume power.
具有 Mali GPU 的系统通常具有共享内存,因此不需要复制数据。但是 OpenCL 假定内存是分开的,并且缓冲区分配涉及内存复制。这是浪费的,因为复制需要时间并消耗功率。

The following table shows the different cl_mem_flags parameters in clCreateBuffer().
下表显示了 clCreateBuffer() 中不同的 cl_mem_flags 参数。

Table 8-1 Parameters for clCreateBuffer()
Parameter - Description
CL_MEM_ALLOC_HOST_PTR:
This is a hint to the driver indicating that the buffer is accessed on the host side. To use the buffer on the application processor side, you must map this buffer and write the data into it. This is the only method that does not involve copying data. If you must fill in an image that is processed by the GPU, this is the best way to avoid a copy.
这是对驱动程序的提示,指示在主机端访问缓冲区。要在应用程序处理器端使用缓冲区,必须映射此缓冲区并将数据写入其中。这是唯一不涉及复制数据的方法。如果必须填充由 GPU 处理的镜像,这是避免复制的最佳方法。

CL_MEM_COPY_HOST_PTR:
Copies the data pointed to by the host_ptr argument into memory allocated by the driver.
host_ptr 参数指向的数据复制到驱动程序分配的内存中。

CL_MEM_USE_HOST_PTR:
Copies the data pointed to by the host memory pointer into the buffer when the first kernel using this buffer starts running. This flag enforces memory restrictions that can reduce performance. Avoid using this if possible.
当使用该缓冲区的第一个内核开始运行时,将主机内存指针指向的数据复制到缓冲区中。该标志强制执行可能会降低性能的内存限制。尽可能避免使用此功能。

When a map is executed, the memory must be copied back to the provided host pointer. This significantly increases the cost of map operations.
执行映射时,必须将内存复制回提供的主机指针。这大大增加了映射操作的成本。

Arm recommends the following:

  1. Do not use private or local memory to improve memory read performance. (不要使用 private or local 内存来提高内存读取性能。)
  2. If your kernel is memory bandwidth bound, try using a simple formula to compute variables instead of reading from memory. This saves memory bandwidth and might be faster. (如果您的内核受内存带宽限制,请尝试使用一个简单的公式来计算变量,而不是从内存中读取。这样可以节省内存带宽,并且速度可能更快。)
  3. If your kernel is compute bound, try reading from memory instead of computing variables. This saves computations and might be faster. (如果您的内核是计算机绑定的,请尝试从内存中读取而不是计算变量。这样可以节省计算,并且可能更快。)
  4. If you are using a Mali Bifrost or Valhall GPU in a fully coherent system, use fine-grain shared virtual memory. See Shared virtual memory.
    如果在完全一致的系统中使用 Mali Bifrost 或 Valhall GPU,请使用细粒度的共享虚拟内存。

8.3.2 Use CL_MEM_ALLOC_HOST_PTR to avoid copying memory (使用 CL_MEM_ALLOC_HOST_PTR 避免复制内存)

The Mali GPU can access the memory buffers created by clCreateBuffer(CL_MEM_ALLOC_HOST_PTR). This is the preferred method to allocate buffers because data copies are not required.
Mali GPU 可以访问由 clCreateBuffer(CL_MEM_ALLOC_HOST_PTR) 创建的内存缓冲区。这是分配缓冲区的首选方法,因为不需要数据复制。

This method of allocating buffers is shown in the following figure.
下图显示了这种分配缓冲区的方法。


Figure 8-1 Memory buffer created by clCreateBuffer(CL_MEM_ALLOC_HOST_PTR)

Arm recommends the following:

  1. You must make the initial memory allocation through the OpenCL API. (您必须通过 OpenCL API 进行初始内存分配。)
  2. Always use the latest pointer returned. If a buffer is repeatedly mapped and unmapped, the address the buffer maps into, is not guaranteed to be the same. (始终使用返回的最新指针。如果缓冲区被反复映射和取消映射,则不能保证该缓冲区映射到的地址相同。)
  3. If you are using a Mali Bifrost or Valhall GPU in a fully coherent system, use fine-grain shared virtual memory. See Shared virtual memory.
    如果在完全一致的系统中使用 Mali Bifrost 或 Valhall GPU,请使用细粒度的共享虚拟内存。

8.3.3 Do not create buffers with CL_MEM_USE_HOST_PTR if possible (如果可能,不要使用 CL_MEM_USE_HOST_PTR 创建缓冲区)

When a memory buffer is created using clCreateBuffer(CL_MEM_USE_HOST_PTR), the driver might be required to copy the data to a separate buffer. This copy enables a kernel running on the GPU to access it. If the kernel modifies the buffer and the application maps the buffer so that it can be read, the driver copies the updated data back to the original location. The driver uses the application processor to perform these copy operations, that are computationally expensive.
使用 clCreateBuffer(CL_MEM_USE_HOST_PTR) 创建内存缓冲区时,可能需要驱动程序将数据复制到单独的缓冲区。此复制使运行在 GPU 上的内核可以访问它。如果内核修改了缓冲区,并且应用程序映射了缓冲区以便可以读取缓冲区,则驱动程序会将更新的数据复制回原始位置。驱动程序使用应用程序处理器执行这些复制操作,这些操作在计算上很昂贵。

This method of allocating buffers is shown in the following figure.
下图显示了这种分配缓冲区的方法。


Figure 8-2 Memory buffer created by clCreateBuffer(CL_MEM_USE_HOST_PTR)

If your application can use an alternative allocation type, it can avoid these computationally expensive copy operations. For example, CL_MEM_ALLOC_HOST_PTR.
如果您的应用程序可以使用其他分配类型,则可以避免这些计算量大的复制操作。例如,CL_MEM_ALLOC_HOST_PTR

8.3.4 Sharing memory between I/O devices and OpenCL (在 I/O 设备和 OpenCL 之间共享内存)

For an I/O device to share memory with OpenCL, you must allocate the memory in OpenCL with CL_MEM_ALLOC_HOST_PTR.
为了使 I/O 设备与 OpenCL 共享内存,必须在 OpenCL 中使用 CL_MEM_ALLOC_HOST_PTR 分配内存。

You must allocate the memory in OpenCL with CL_MEM_ALLOC_HOST_PTR because it ensures that the memory pages are always mapped into physical memory.
您必须在 OpenCL 中分配内存,CL_MEM_ALLOC_HOST_PTR 因为它可以确保将内存页始终映射到物理内存。

If you allocate the memory on the application processor, the OS might not allocate physical memory to the pages until they are used for the first time. Errors occur if an I/O device attempts to use unmapped pages.
如果在应用程序处理器上分配内存,则操作系统可能不会为页面分配物理内存,直到它们是首次使用。如果 I/O 设备尝试使用未映射的页面,则会发生错误。

8.3.5 Sharing memory in a fully coherent system (在完全一致的系统中共享内存)

Systems with full system coherency enable application processors and GPUs to share data easily, increasing performance.
具有完全系统一致性的系统使应用程序处理器和 GPU 可以轻松共享数据,从而提高性能。

With full system coherency, application processors and GPUs can access memory without requiring cache clean or invalidate operations on memory objects. This provides better performance than an I/O coherent system when the data is shared between application processor and GPU.
凭借完整的系统一致性,应用程序处理器和 GPU 可以访问内存,而无需清除缓存或在内存对象上的无效操作。当在应用处理器和 GPU 之间共享数据时,这提供了比 I/O 一致系统更好的性能。

Fully coherent systems with Mali Bifrost or Valhall GPUs support fine-grained shared virtual memory in OpenCL 2.0 or later. See Shared virtual memory.
Mali Bifrost 或 Valhall GPU 的完全一致的系统在 OpenCL 2.0 或更高版本中支持细粒度的共享虚拟内存。See Shared virtual memory.

8.3.6 Sharing memory in an I/O coherent system

With I/O coherent allocation, the driver is not required to perform cache clean or invalidate operations on memory objects, before or after they are used on the Mali GPU. If you are using a memory object on both the application processor and the Mali GPU, this can improve performance.
使用 I/O 一致性分配,不需要驱动程序在 Mali GPU 上使用内存对象之前或之后对内存对象执行缓存清理或无效操作。如果在应用程序处理器和 Mali GPU上 都使用内存对象,则可以提高性能。

If your platform is I/O coherent, you can enable I/O coherent memory allocation by passing the CL_MEM_ALLOC_HOST_PTR flag to clCreateBuffer() or clCreateImage().
如果您的平台是 I/O 一致性的,则可以通过将 CL_MEM_ALLOC_HOST_PTR 标志传递给 clCreateBuffer() or clCreateImage() 来启用 I/O 一致性的内存分配。

If you are using OpenCL 2.0 or later and your platform is I/O coherent, use shared virtual memory. See Shared virtual memory.
如果您使用的是 OpenCL 2.0 或更高版本,并且您的平台是 I/O 一致性的,请使用共享虚拟内存。请参阅共享虚拟内存。

更多推荐

Arm Mali Bifrost and Valhall OpenCL Developer Guide - Chapter 8 - Optimizing Ope