注册 登录
电子工程世界-论坛 返回首页 EEWORLD首页 频道 EE大学堂 下载中心 Datasheet 专题
zhaoyongke的个人空间 https://home.eeworld.com.cn/space-uid-169743.html [收藏] [复制] [分享] [RSS]
日志

【Altera SoC体验之旅】+ 正式开启OpenCL模式

已有 5333 次阅读2015-2-6 23:37 |个人分类:Altera SoC深度体验之旅

最近可谓几经周折。先前的Lark板子虽然看上去很高端,但实在是资料太少,对于我的应用来说从头开始搭模块不太现实。
与EEWorld 影子 沟通后,在她帮助下,和网友 @chenzhufly 互换了板子,他用的是Arrow  SoC。这个板子资料丰富一些,至少在RocketBoard上有很多教程和资料。
一切看上去都很完美,但做完所有实验后发现,本来Altera承诺的“支持OpenCL开发”结果是一句口号,我找遍了官网也没有发现这块板子的BSP。问过了Arrow的员工 @Alex,得到回答也是暂时还没有BSP。
于是不得以,又换了一块支持OpenCL开发的板子——友晶的DE1-SoC,这块性价比最高的板子。与我交换板子的是 @coyoo 大神(《深入理解Altera FPGA应用设计》作者),不得不说,论坛果然卧虎藏龙啊。

有幸参加这次比赛,有幸体验了三块不同的板子(总共才4块,太值了),有幸认识了一群技术上的大牛,想想这次赚大发了。

一定有同学会问,你到底要做什么东东,非要用Open CL?

不止一个人问过这个问题了,其实我看到这个比赛时,想想自己都已经不是学生了,没有那么多课外时间搞比赛,所以没打算报名,但刚好看到在全球计算机大会上Altera与百度合作研发的深度神经网络加速器(DNN by FPGA),而自己恰好又有个想法在FPGA上完成卷积神经网络的搭建(工作相关),各种机缘巧合下,毅然报名了。

神经网络有什么用途?它是模拟人大脑的组织形式,用大量神经元之间相互传递消息实现认知功能的,最简单的例子就是物体识别,人看到一张桌子,就会知道这是个桌子,而不是凳子,因为符合“桌子”特征。在人脑中已经通过大量训练,将“桌子”特征记录在神经元之间的权值上了。而对于计算机,通过摄像头看到桌子时,只是一堆像素值(RGB),浅层次的处理如中值滤波,相关,Sobel滤波是无法认知“桌子”这个特征的,而只是将某一维度的信息呈现给用户,让用户自己判断。为了将信息有效组织,需要构建大量的相同功能的神经元,每个单元执行最基本的操作(将输入累加,满足条件时输出给下一个神经元),这样层层累积,最终实现深层次的认知功能,在最末端的神经元直接可以回答“这是个桌子”或者“这是个凳子”或者“这是个椅子”。
卷积神经网络是在上面神经网络基础上做了一些近似。将同一层的神经元权值共享,减少了连接数,有利于计算机实现。

好了,说了这么多,其实说白了一句话就是,我目前算法是用C/C++以及CUDA实现的,如果迁移到FPGA上运行,使用OpenCL是最快的方式,也是这次体验最重要的内容(以前在FPGA上开发都是VHDL/Verilog,设计+仿真验证+调试太花时间,短期内难以完成,而且我目前只关心算法,不关心底层实现,如果能实现最基本的功能,这一阶段就算完成了,后面再考虑资源、时序、性能上的优化。

拿到板子后,仔细阅读了官方文档,搭建OpenCL环境。

今天时间关系,不再详细展开OpenCL的语法、结构,直接上例子。

烧写TF卡,流程参考我之前的帖子。烧写完成,将SW10拨码开关设置为“01010”(这个很重要,如果没有配置FPGA,后面脚本会lock),上电启动。
上一张图:

PC上打开Putty,设置波特率115200,用户名root,没有密码,进入系统。

可以看得出系统是Poky 8.0 (Yocto Project 1.3 Reference Distro) 1.3 socfpga ttyS0,和之前Lark板子上默认的系统是一样的。
ls一下,当前目录下有很多例程。
先做个准备活动:运行初始化OpenCL环境的脚本:
source ./init_opencl.sh
很快就结束了。我们打开看下这个脚本内容都是什么东东?
  1. root@socfpga:~/vector_Add# cat ~/init_opencl.sh
  2. export ALTERAOCLSDKROOT=/home/root/opencl_arm32_rte
  3. export AOCL_BOARD_PACKAGE_ROOT=$ALTERAOCLSDKROOT/board/c5soc
  4. export PATH=$ALTERAOCLSDKROOT/bin:$PATH
  5. export LD_LIBRARY_PATH=$ALTERAOCLSDKROOT/host/arm32/lib:$LD_LIBRARY_PATH
  6. insmod $AOCL_BOARD_PACKAGE_ROOT/driver/aclsoc_drv.ko
复制代码
首先设置了几个环境变量:
ALTERAOCLSDKROOT
AOCL_BOARD_PACKAGE_ROOT
PATH
LD_LIBRARY_PATH
之后执行了insmod操作,加载驱动。
我们可以知道OpenCL的服务是由驱动模块$AOCL_BOARD_PACKAGE_ROOT/driver/aclsoc_drv.ko 提供的。
OK,就绪,下面先进入helloworld目录。
  1. root@socfpga:~# cd helloworld/
  2. root@socfpga:~/helloworld# ls
  3. hello_world.aocx  helloworld
复制代码
这个目录有hello_world.aocx和 helloworld两个文件。前者运行在FPGA上(OpenCL中称为核函数, Kernel),后者运行在ARM上(OpenCL中称为主机程序,Host Program)。两者编译过程如图所示。

运行步骤如下:
  1. root@socfpga:~/helloworld# aocl program /dev/acl0 hello_world.aocx
  2. aocl program: Running reprogram from /home/root/opencl_arm32_rte/board/c5soc/arm32/bin
  3. Reprogramming was successful!
复制代码
可见,运行成功了。
想看源代码,可以在DE1-SoC_openCL_BSP.zip中找到,路径为examples/helloworld/。
后缀为.cl的文件为核函数。上面例子的核函数如下:
  1. // AOC kernel demonstrating device-side printf call
  2. __kernel void hello_world(int thread_id_from_which_to_print_message) {
  3.   // Get index of the work item
  4.   unsigned thread_id = get_global_id(0);

  5.   if(thread_id == thread_id_from_which_to_print_message) {
  6.     printf("Thread #%u: Hello from Altera's OpenCL Compiler!\n", thread_id);
  7.   }
  8. }
复制代码
类似C函数,只不过前缀加上“__kernel”关键词,指定它运行在设备(FPGA)上。使用Altera的OpenCL工具就可以编译为FPGA比特流配置文件。
这里的函数功能很简单,只是判断自身线程号是否与主机指定的相同,如果相同则输出一句话,其他线程保持沉默。
接着看下Host Program长什么样。
  1. #include
  2. #include
  3. #include
  4. #include
  5. #include
  6. #include "CL/opencl.h"
  7. #include "AOCL_Utils.h"

  8. using namespace aocl_utils;

  9. #define STRING_BUFFER_LEN 1024

  10. // Runtime constants
  11. // Used to define the work set over which this kernel will execute.
  12. static const size_t work_group_size = 8;  // 8 threads in the demo workgroup
  13. // Defines kernel argument value, which is the workitem ID that will
  14. // execute a printf call
  15. static const int thread_id_to_output = 2;

  16. // OpenCL runtime configuration
  17. static cl_platform_id platform = NULL;
  18. static cl_device_id device = NULL;
  19. static cl_context context = NULL;
  20. static cl_command_queue queue = NULL;
  21. static cl_kernel kernel = NULL;
  22. static cl_program program = NULL;

  23. // Function prototypes
  24. bool init();
  25. void cleanup();
  26. static void device_info_ulong( cl_device_id device, cl_device_info param, const char* name);
  27. static void device_info_uint( cl_device_id device, cl_device_info param, const char* name);
  28. static void device_info_bool( cl_device_id device, cl_device_info param, const char* name);
  29. static void device_info_string( cl_device_id device, cl_device_info param, const char* name);
  30. static void display_device_info( cl_device_id device );

  31. // Entry point.
  32. int main() {
  33.   cl_int status;

  34.   if(!init()) {
  35.     return -1;
  36.   }

  37.   // Set the kernel argument (argument 0)
  38.   status = clSetKernelArg(kernel, 0, sizeof(cl_int), (void*)&thread_id_to_output);
  39.   checkError(status, "Failed to set kernel arg 0");

  40.   printf("\nKernel initialization is complete.\n");
  41.   printf("Launching the kernel...\n\n");

  42.   // Configure work set over which the kernel will execute
  43.   size_t wgSize[3] = {work_group_size, 1, 1};
  44.   size_t gSize[3] = {work_group_size, 1, 1};

  45.   // Launch the kernel
  46.   status = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, gSize, wgSize, 0, NULL, NULL);
  47.   checkError(status, "Failed to launch kernel");

  48.   // Wait for command queue to complete pending events
  49.   status = clFinish(queue);
  50.   checkError(status, "Failed to finish");

  51.   printf("\nKernel execution is complete.\n");

  52.   // Free the resources allocated
  53.   cleanup();

  54.   return 0;
  55. }

  56. /////// HELPER FUNCTIONS ///////

  57. bool init() {
  58.   cl_int status;

  59.   if(!setCwdToExeDir()) {
  60.     return false;
  61.   }

  62.   // Get the OpenCL platform.
  63.   platform = findPlatform("Altera");
  64.   if(platform == NULL) {
  65.     printf("ERROR: Unable to find Altera OpenCL platform.\n");
  66.     return false;
  67.   }

  68.   // User-visible output - Platform information
  69.   {
  70.     char char_buffer[STRING_BUFFER_LEN];
  71.     printf("Querying platform for info:\n");
  72.     printf("==========================\n");
  73.     clGetPlatformInfo(platform, CL_PLATFORM_NAME, STRING_BUFFER_LEN, char_buffer, NULL);
  74.     printf("%-40s = %s\n", "CL_PLATFORM_NAME", char_buffer);
  75.     clGetPlatformInfo(platform, CL_PLATFORM_VENDOR, STRING_BUFFER_LEN, char_buffer, NULL);
  76.     printf("%-40s = %s\n", "CL_PLATFORM_VENDOR ", char_buffer);
  77.     clGetPlatformInfo(platform, CL_PLATFORM_VERSION, STRING_BUFFER_LEN, char_buffer, NULL);
  78.     printf("%-40s = %s\n\n", "CL_PLATFORM_VERSION ", char_buffer);
  79.   }

  80.   // Query the available OpenCL devices.
  81.   scoped_array devices;
  82.   cl_uint num_devices;

  83.   devices.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));

  84.   // We'll just use the first device.
  85.   device = devices[0];

  86.   // Display some device information.
  87.   display_device_info(device);

  88.   // Create the context.
  89.   context = clCreateContext(NULL, 1, &device, NULL, NULL, &status);
  90.   checkError(status, "Failed to create context");

  91.   // Create the command queue.
  92.   queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &status);
  93.   checkError(status, "Failed to create command queue");

  94.   // Create the program.
  95.   std::string binary_file = getBoardBinaryFile("hello_world", device);
  96.   printf("Using AOCX: %s\n", binary_file.c_str());
  97.   program = createProgramFromBinary(context, binary_file.c_str(), &device, 1);

  98.   // Build the program that was just created.
  99.   status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
  100.   checkError(status, "Failed to build program");

  101.   // Create the kernel - name passed in here must match kernel name in the
  102.   // original CL file, that was compiled into an AOCX file using the AOC tool
  103.   const char *kernel_name = "hello_world";  // Kernel name, as defined in the CL file
  104.   kernel = clCreateKernel(program, kernel_name, &status);
  105.   checkError(status, "Failed to create kernel");

  106.   return true;
  107. }

  108. // Free the resources allocated during initialization
  109. void cleanup() {
  110.   if(kernel) {
  111.     clReleaseKernel(kernel);  
  112.   }
  113.   if(program) {
  114.     clReleaseProgram(program);
  115.   }
  116.   if(queue) {
  117.     clReleaseCommandQueue(queue);
  118.   }
  119.   if(context) {
  120.     clReleaseContext(context);
  121.   }
  122. }

  123. // Helper functions to display parameters returned by OpenCL queries
  124. static void device_info_ulong( cl_device_id device, cl_device_info param, const char* name) {
  125.    cl_ulong a;
  126.    clGetDeviceInfo(device, param, sizeof(cl_ulong), &a, NULL);
  127.    printf("%-40s = %lu\n", name, a);
  128. }
  129. static void device_info_uint( cl_device_id device, cl_device_info param, const char* name) {
  130.    cl_uint a;
  131.    clGetDeviceInfo(device, param, sizeof(cl_uint), &a, NULL);
  132.    printf("%-40s = %u\n", name, a);
  133. }
  134. static void device_info_bool( cl_device_id device, cl_device_info param, const char* name) {
  135.    cl_bool a;
  136.    clGetDeviceInfo(device, param, sizeof(cl_bool), &a, NULL);
  137.    printf("%-40s = %s\n", name, (a?"true":"false"));
  138. }
  139. static void device_info_string( cl_device_id device, cl_device_info param, const char* name) {
  140.    char a[STRING_BUFFER_LEN];
  141.    clGetDeviceInfo(device, param, STRING_BUFFER_LEN, &a, NULL);
  142.    printf("%-40s = %s\n", name, a);
  143. }

  144. // Query and display OpenCL information on device and runtime environment
  145. static void display_device_info( cl_device_id device ) {

  146.    printf("Querying device for info:\n");
  147.    printf("========================\n");
  148.    device_info_string(device, CL_DEVICE_NAME, "CL_DEVICE_NAME");
  149.    device_info_string(device, CL_DEVICE_VENDOR, "CL_DEVICE_VENDOR");
  150.    device_info_uint(device, CL_DEVICE_VENDOR_ID, "CL_DEVICE_VENDOR_ID");
  151.    device_info_string(device, CL_DEVICE_VERSION, "CL_DEVICE_VERSION");
  152.    device_info_string(device, CL_DRIVER_VERSION, "CL_DRIVER_VERSION");
  153.    device_info_uint(device, CL_DEVICE_ADDRESS_BITS, "CL_DEVICE_ADDRESS_BITS");
  154.    device_info_bool(device, CL_DEVICE_AVAILABLE, "CL_DEVICE_AVAILABLE");
  155.    device_info_bool(device, CL_DEVICE_ENDIAN_LITTLE, "CL_DEVICE_ENDIAN_LITTLE");
  156.    device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, "CL_DEVICE_GLOBAL_MEM_CACHE_SIZE");
  157.    device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, "CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE");
  158.    device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_SIZE, "CL_DEVICE_GLOBAL_MEM_SIZE");
  159.    device_info_bool(device, CL_DEVICE_IMAGE_SUPPORT, "CL_DEVICE_IMAGE_SUPPORT");
  160.    device_info_ulong(device, CL_DEVICE_LOCAL_MEM_SIZE, "CL_DEVICE_LOCAL_MEM_SIZE");
  161.    device_info_ulong(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, "CL_DEVICE_MAX_CLOCK_FREQUENCY");
  162.    device_info_ulong(device, CL_DEVICE_MAX_COMPUTE_UNITS, "CL_DEVICE_MAX_COMPUTE_UNITS");
  163.    device_info_ulong(device, CL_DEVICE_MAX_CONSTANT_ARGS, "CL_DEVICE_MAX_CONSTANT_ARGS");
  164.    device_info_ulong(device, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, "CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE");
  165.    device_info_uint(device, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, "CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS");
  166.    device_info_uint(device, CL_DEVICE_MEM_BASE_ADDR_ALIGN, "CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS");
  167.    device_info_uint(device, CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE, "CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE");
  168.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR");
  169.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT");
  170.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT");
  171.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG");
  172.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT");
  173.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE");

  174.    {
  175.       cl_command_queue_properties ccp;
  176.       clGetDeviceInfo(device, CL_DEVICE_QUEUE_PROPERTIES, sizeof(cl_command_queue_properties), &ccp, NULL);
  177.       printf("%-40s = %s\n", "Command queue out of order? ", ((ccp & CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE)?"true":"false"));
  178.       printf("%-40s = %s\n", "Command queue profiling enabled? ", ((ccp & CL_QUEUE_PROFILING_ENABLE)?"true":"false"));
  179.    }
  180. }
复制代码
主机程序比较长,主要执行流程为:
初始化平台、寻找设备、打印设备信息、创建设备上下文、在设备上下文中创建指令队列、载入设备代码、编译设备代码、创建核函数对象、设置核函数参数、运行核函数、等待核函数运行结束、清除所有对象。
这是OpenCL的最基本流程,虽然比较繁琐,但熟悉之后几乎每次都是这几步,代码改动很少,真正需要用心设计的是核函数。

好了,再运行一个例子就睡觉。
进入上一级目录,然后切入vectorAdd,运行一下:

  1. root@socfpga:~/helloworld# cd ..
  2. root@socfpga:~# ls
  3. README            helloworld        opencl_arm32_rte  vector_Add
  4. boardtest         init_opencl.sh    swapper
  5. root@socfpga:~# cd vector_Add/
  6. root@socfpga:~/vector_Add# ls
  7. vectorAdd       vectorAdd.aocx
  8. root@socfpga:~/vector_Add# aocl program /dev/acl0 vectorAdd.aocx
  9. aocl program: Running reprogram from /home/root/opencl_arm32_rte/board/c5soc/arm32/bin
  10. Reprogramming was successful!
  11. root@socfpga:~/vector_Add# ./vectorAdd
  12. Initializing OpenCL
  13. Platform: Altera SDK for OpenCL
  14. Using 1 device(s)
  15.   de1soc_sharedonly : Cyclone V SoC Development Kit
  16. Using AOCX: vectorAdd.aocx
  17. Launching for device 0 (1000000 elements)

  18. Time: 107.127 ms
  19. Kernel time (device 0): 6.933 ms

  20. Verification: PASS
复制代码


这是个向量相加的例子,也是很经典的并行计算例子。核函数内容如下:
  1. __kernel void vectorAdd(__global const float *x,
  2.                         __global const float *y,
  3.                         __global float *restrict z)
  4. {
  5.     // get index of the work item
  6.     int index = get_global_id(0);

  7.     // add the vector elements
  8.     z[index] = x[index] + y[index];
  9. }
复制代码
主机程序如下:
  1. #include
  2. #include
  3. #include
  4. #include "CL/opencl.h"
  5. #include "AOCL_Utils.h"

  6. using namespace aocl_utils;

  7. // OpenCL runtime configuration
  8. cl_platform_id platform = NULL;
  9. unsigned num_devices = 0;
  10. scoped_array device; // num_devices elements
  11. cl_context context = NULL;
  12. scoped_array queue; // num_devices elements
  13. cl_program program = NULL;
  14. scoped_array kernel; // num_devices elements
  15. scoped_array input_a_buf; // num_devices elements
  16. scoped_array input_b_buf; // num_devices elements
  17. scoped_array output_buf; // num_devices elements

  18. // Problem data.
  19. const unsigned N = 1000000; // problem size
  20. scoped_array > input_a, input_b; // num_devices elements
  21. scoped_array > output; // num_devices elements
  22. scoped_array > ref_output; // num_devices elements
  23. scoped_array n_per_device; // num_devices elements

  24. // Function prototypes
  25. float rand_float();
  26. bool init_opencl();
  27. void init_problem();
  28. void run();
  29. void cleanup();

  30. // Entry point.
  31. int main() {
  32.   // Initialize OpenCL.
  33.   if(!init_opencl()) {
  34.     return -1;
  35.   }

  36.   // Initialize the problem data.
  37.   // Requires the number of devices to be known.
  38.   init_problem();

  39.   // Run the kernel.
  40.   run();

  41.   // Free the resources allocated
  42.   cleanup();

  43.   return 0;
  44. }

  45. /////// HELPER FUNCTIONS ///////

  46. // Randomly generate a floating-point number between -10 and 10.
  47. float rand_float() {
  48.   return float(rand()) / float(RAND_MAX) * 20.0f - 10.0f;
  49. }

  50. // Initializes the OpenCL objects.
  51. bool init_opencl() {
  52.   cl_int status;

  53.   printf("Initializing OpenCL\n");

  54.   if(!setCwdToExeDir()) {
  55.     return false;
  56.   }

  57.   // Get the OpenCL platform.
  58.   platform = findPlatform("Altera");
  59.   if(platform == NULL) {
  60.     printf("ERROR: Unable to find Altera OpenCL platform.\n");
  61.     return false;
  62.   }

  63.   // Query the available OpenCL device.
  64.   device.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));
  65.   printf("Platform: %s\n", getPlatformName(platform).c_str());
  66.   printf("Using %d device(s)\n", num_devices);
  67.   for(unsigned i = 0; i < num_devices; ++i) {
  68.     printf("  %s\n", getDeviceName(device[i]).c_str());
  69.   }

  70.   // Create the context.
  71.   context = clCreateContext(NULL, num_devices, device, NULL, NULL, &status);
  72.   checkError(status, "Failed to create context");

  73.   // Create the program for all device. Use the first device as the
  74.   // representative device (assuming all device are of the same type).
  75.   std::string binary_file = getBoardBinaryFile("vectorAdd", device[0]);
  76.   printf("Using AOCX: %s\n", binary_file.c_str());
  77.   program = createProgramFromBinary(context, binary_file.c_str(), device, num_devices);

  78.   // Build the program that was just created.
  79.   status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
  80.   checkError(status, "Failed to build program");

  81.   // Create per-device objects.
  82.   queue.reset(num_devices);
  83.   kernel.reset(num_devices);
  84.   n_per_device.reset(num_devices);
  85.   input_a_buf.reset(num_devices);
  86.   input_b_buf.reset(num_devices);
  87.   output_buf.reset(num_devices);

  88.   for(unsigned i = 0; i < num_devices; ++i) {
  89.     // Command queue.
  90.     queue[i] = clCreateCommandQueue(context, device[i], CL_QUEUE_PROFILING_ENABLE, &status);
  91.     checkError(status, "Failed to create command queue");

  92.     // Kernel.
  93.     const char *kernel_name = "vectorAdd";
  94.     kernel[i] = clCreateKernel(program, kernel_name, &status);
  95.     checkError(status, "Failed to create kernel");

  96.     // Determine the number of elements processed by this device.
  97.     n_per_device[i] = N / num_devices; // number of elements handled by this device

  98.     // Spread out the remainder of the elements over the first
  99.     // N % num_devices.
  100.     if(i < (N % num_devices)) {
  101.       n_per_device[i]++;
  102.     }

  103.     // Input buffers.
  104.     input_a_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY,
  105.         n_per_device[i] * sizeof(float), NULL, &status);
  106.     checkError(status, "Failed to create buffer for input A");

  107.     input_b_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY,
  108.         n_per_device[i] * sizeof(float), NULL, &status);
  109.     checkError(status, "Failed to create buffer for input B");

  110.     // Output buffer.
  111.     output_buf[i] = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
  112.         n_per_device[i] * sizeof(float), NULL, &status);
  113.     checkError(status, "Failed to create buffer for output");
  114.   }

  115.   return true;
  116. }

  117. // Initialize the data for the problem. Requires num_devices to be known.
  118. void init_problem() {
  119.   if(num_devices == 0) {
  120.     checkError(-1, "No devices");
  121.   }

  122.   input_a.reset(num_devices);
  123.   input_b.reset(num_devices);
  124.   output.reset(num_devices);
  125.   ref_output.reset(num_devices);

  126.   // Generate input vectors A and B and the reference output consisting
  127.   // of a total of N elements.
  128.   // We create separate arrays for each device so that each device has an
  129.   // aligned buffer.
  130.   for(unsigned i = 0; i < num_devices; ++i) {
  131.     input_a[i].reset(n_per_device[i]);
  132.     input_b[i].reset(n_per_device[i]);
  133.     output[i].reset(n_per_device[i]);
  134.     ref_output[i].reset(n_per_device[i]);

  135.     for(unsigned j = 0; j < n_per_device[i]; ++j) {
  136.       input_a[i][j] = rand_float();
  137.       input_b[i][j] = rand_float();
  138.       ref_output[i][j] = input_a[i][j] + input_b[i][j];
  139.     }
  140.   }
  141. }

  142. void run() {
  143.   cl_int status;

  144.   const double start_time = getCurrentTimestamp();

  145.   // Launch the problem for each device.
  146.   scoped_array kernel_event(num_devices);
  147.   scoped_array finish_event(num_devices);

  148.   for(unsigned i = 0; i < num_devices; ++i) {

  149.     // Transfer inputs to each device. Each of the host buffers supplied to
  150.     // clEnqueueWriteBuffer here is already aligned to ensure that DMA is used
  151.     // for the host-to-device transfer.
  152.     cl_event write_event[2];
  153.     status = clEnqueueWriteBuffer(queue[i], input_a_buf[i], CL_FALSE,
  154.         0, n_per_device[i] * sizeof(float), input_a[i], 0, NULL, &write_event[0]);
  155.     checkError(status, "Failed to transfer input A");

  156.     status = clEnqueueWriteBuffer(queue[i], input_b_buf[i], CL_FALSE,
  157.         0, n_per_device[i] * sizeof(float), input_b[i], 0, NULL, &write_event[1]);
  158.     checkError(status, "Failed to transfer input B");

  159.     // Set kernel arguments.
  160.     unsigned argi = 0;

  161.     status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_a_buf[i]);
  162.     checkError(status, "Failed to set argument %d", argi - 1);

  163.     status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_b_buf[i]);
  164.     checkError(status, "Failed to set argument %d", argi - 1);

  165.     status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &output_buf[i]);
  166.     checkError(status, "Failed to set argument %d", argi - 1);

  167.     // Enqueue kernel.
  168.     // Use a global work size corresponding to the number of elements to add
  169.     // for this device.
  170.     //
  171.     // We don't specify a local work size and let the runtime choose
  172.     // (it'll choose to use one work-group with the same size as the global
  173.     // work-size).
  174.     //
  175.     // Events are used to ensure that the kernel is not launched until
  176.     // the writes to the input buffers have completed.
  177.     const size_t global_work_size = n_per_device[i];
  178.     printf("Launching for device %d (%d elements)\n", i, global_work_size);

  179.     status = clEnqueueNDRangeKernel(queue[i], kernel[i], 1, NULL,
  180.         &global_work_size, NULL, 2, write_event, &kernel_event[i]);
  181.     checkError(status, "Failed to launch kernel");

  182.     // Read the result. This the final operation.
  183.     status = clEnqueueReadBuffer(queue[i], output_buf[i], CL_FALSE,
  184.         0, n_per_device[i] * sizeof(float), output[i], 1, &kernel_event[i], &finish_event[i]);

  185.     // Release local events.
  186.     clReleaseEvent(write_event[0]);
  187.     clReleaseEvent(write_event[1]);
  188.   }

  189.   // Wait for all devices to finish.
  190.   clWaitForEvents(num_devices, finish_event);

  191.   const double end_time = getCurrentTimestamp();

  192.   // Wall-clock time taken.
  193.   printf("\nTime: %0.3f ms\n", (end_time - start_time) * 1e3);

  194.   // Get kernel times using the OpenCL event profiling API.
  195.   for(unsigned i = 0; i < num_devices; ++i) {
  196.     cl_ulong time_ns = getStartEndTime(kernel_event[i]);
  197.     printf("Kernel time (device %d): %0.3f ms\n", i, double(time_ns) * 1e-6);
  198.   }

  199.   // Release all events.
  200.   for(unsigned i = 0; i < num_devices; ++i) {
  201.     clReleaseEvent(kernel_event[i]);
  202.     clReleaseEvent(finish_event[i]);
  203.   }

  204.   // Verify results.
  205.   bool pass = true;
  206.   for(unsigned i = 0; i < num_devices && pass; ++i) {
  207.     for(unsigned j = 0; j < n_per_device[i] && pass; ++j) {
  208.       if(fabsf(output[i][j] - ref_output[i][j]) > 1.0e-5f) {
  209.         printf("Failed verification @ device %d, index %d\nOutput: %f\nReference: %f\n",
  210.             i, j, output[i][j], ref_output[i][j]);
  211.         pass = false;
  212.       }
  213.     }
  214.   }

  215.   printf("\nVerification: %s\n", pass ? "PASS" : "FAIL");
  216. }

  217. // Free the resources allocated during initialization
  218. void cleanup() {
  219.   for(unsigned i = 0; i < num_devices; ++i) {
  220.     if(kernel && kernel[i]) {
  221.       clReleaseKernel(kernel[i]);
  222.     }
  223.     if(queue && queue[i]) {
  224.       clReleaseCommandQueue(queue[i]);
  225.     }
  226.     if(input_a_buf && input_a_buf[i]) {
  227.       clReleaseMemObject(input_a_buf[i]);
  228.     }
  229.     if(input_b_buf && input_b_buf[i]) {
  230.       clReleaseMemObject(input_b_buf[i]);
  231.     }
  232.     if(output_buf && output_buf[i]) {
  233.       clReleaseMemObject(output_buf[i]);
  234.     }
  235.   }

  236.   if(program) {
  237.     clReleaseProgram(program);
  238.   }
  239.   if(context) {
  240.     clReleaseContext(context);
  241.   }
  242. }
复制代码
将100w维度的两个向量相加,用时107.127ms,你可以试试只用ARM计算,看需要多久,对比下性能。

好了,今天到此为止,大家晚安!

本文来自论坛,点击查看完整帖子内容。

评论 (0 个评论)

facelist doodle 涂鸦板

您需要登录后才可以评论 登录 | 注册

热门文章