美文网首页oneAPI
oneAPI的测试:Vector-add

oneAPI的测试:Vector-add

作者: Parker2019 | 来源:发表于2020-07-08 16:59 被阅读0次

    oneAPI简介

    Intel 的oneAPI,目的是简化跨CPU、GPU、FPGA、人工智能和其它加速器的各种计算引擎的编程开发。开发人员想达到的目的是一个解决方案,多种架构。2020年5月6月期间发布了Base Kit (Beta)版本。

    官网介绍
    官网链接
    注意可能要Intel账号才能下载。

    安装

    下载的文件是.exe结尾的,有图形化安装界面,Next下去即可。最终解压安装完成的效果如图:

    目录
    2GB左右的安装包,安装完成后大概有14GB的空间占用,如果对于Intel加速库比较熟悉的人应该可以看出MKL,ipp,tbb等库。在此也不得不说Intel的加速库,在正确使用的情况下效果是真的不错。

    这里也想说一点,Intel新推出的11代酷睿处理器的核显已经可以赶上入门独显了。而且Intel自家的FPGA产品也是可以支持OpenCL异构计算的。所以Intel迫切的想要退出一种新的解决方案,让开发者不需要过多的了解底层硬件编写语言:比如OpenCL C,Verilog HDL,同样也可以写出高性能代码。

    同样新的语言营运而生,DPC++(Data Parallel C++),英特尔在设计DPC++的时候,在语法上和CUDA非常接近,如果程序员对于CUDA非常熟悉的话,那么使用DPC++进行编程应该没有任何问题。本质上还是有C/C++语言基础,看懂代码应该没太大难度。

    测试

    Intel 给了oneAPI的编程指导,只不过现在还没中文版:

    GUIDE
    官网指导
    你也可以根据自己选的Toolkit和语言,在左侧栏Document出搜寻对应的文档。

    根据编程指导里的介绍,Intel把所有的sample code放到了github上:


    oneAPI-github-Sample code

    Github

    这里使用的是DPC++ compiler下的vector add,放上代码:

    dpc_common.hpp

    //==============================================================
    // Copyright © 2020 Intel Corporation
    //
    // SPDX-License-Identifier: MIT
    // =============================================================
    
    #ifndef _DP_HPP
    #define _DP_HPP
    
    #pragma once
    
    #include <stdlib.h>
    #include <exception>
    
    #include <CL/sycl.hpp>
    
    namespace dpc {
    // this exception handler with catch async exceptions
    static auto exception_handler = [](cl::sycl::exception_list eList) {
      for (std::exception_ptr const &e : eList) {
        try {
          std::rethrow_exception(e);
        } catch (std::exception const &e) {
    #if _DEBUG
          std::cout << "Failure" << std::endl;
    #endif
          std::terminate();
        }
      }
    };
    
    class queue : public cl::sycl::queue {
      // Enable profiling by default
      cl::sycl::property_list prop_list =
          cl::sycl::property_list{cl::sycl::property::queue::enable_profiling()};
    
     public:
      queue()
          : cl::sycl::queue(cl::sycl::default_selector{}, exception_handler, prop_list) {}
      queue(cl::sycl::device_selector &d)
          : cl::sycl::queue(d, exception_handler, prop_list) {}
      queue(cl::sycl::device_selector &d, cl::sycl::property_list &p)
          : cl::sycl::queue(d, exception_handler, p) {}
    };
    
    using Duration = std::chrono::duration<double>;
    
    class Timer {
     public:
      Timer() : start(std::chrono::steady_clock::now()) {}
    
      Duration elapsed() {
        auto now = std::chrono::steady_clock::now();
        return std::chrono::duration_cast<Duration>(now - start);
      }
    
     private:
      std::chrono::steady_clock::time_point start;
    };
    
    };  // namespace dpc
    
    #endif
    

    vector-add-buffers.cpp

    //==============================================================
    // Vector Add is the equivalent of a Hello, World! sample for data parallel
    // programs. Building and running the sample verifies that your development
    // environment is setup correctly and demonstrates the use of the core features
    // of DPC++. This sample runs on both CPU and GPU (or FPGA). When run, it
    // computes on both the CPU and offload device, then compares results. If the
    // code executes on both CPU and offload device, the device name and a success
    // message are displayed. And, your development environment is setup correctly!
    //
    // For comprehensive instructions regarding DPC++ Programming, go to
    // https://software.intel.com/en-us/oneapi-programming-guide and search based on
    // relevant terms noted in the comments.
    //
    // DPC++ material used in the code sample:
    // •    A one dimensional array of data.
    // •    A device queue, buffer, accessor, and kernel.
    //==============================================================
    // Copyright © 2020 Intel Corporation
    //
    // SPDX-License-Identifier: MIT
    // =============================================================
    #include <CL/sycl.hpp>
    #include <array>
    #include <iostream>
    #include "dpc_common.hpp"
    #if FPGA || FPGA_EMULATOR
    #include <CL/sycl/intel/fpga_extensions.hpp>
    #endif
    
    using namespace sycl;
    
    // Array type and data size for this example.
    constexpr size_t array_size = 10000;
    typedef std::array<int, array_size> IntArray;
    
    //************************************
    // Vector add in DPC++ on device: returns sum in 4th parameter "sum_parallel".
    //************************************
    void VectorAdd(queue &q, const IntArray &a_array, const IntArray &b_array,
                   IntArray &sum_parallel) {
      // Create the range object for the arrays managed by the buffer.
      range<1> num_items{a_array.size()};
    
      // Create buffers that hold the data shared between the host and the devices.
      // The buffer destructor is responsible to copy the data back to host when it
      // goes out of scope.
      buffer a_buf(a_array);
      buffer b_buf(b_array);
      buffer sum_buf(sum_parallel.data(), num_items);
    
      // Submit a command group to the queue by a lambda function that contains the
      // data access permission and device computation (kernel).
      q.submit([&](handler &h) {
        // Create an accessor for each buffer with access permission: read, write or
        // read/write. The accessor is a mean to access the memory in the buffer.
        auto a = a_buf.get_access<access::mode::read>(h);
        auto b = b_buf.get_access<access::mode::read>(h);
    
        // The sum_accessor is used to store (with write permission) the sum data.
        auto sum = sum_buf.get_access<access::mode::write>(h);
    
        // Use parallel_for to run vector addition in parallel on device. This
        // executes the kernel.
        //    1st parameter is the number of work items.
        //    2nd parameter is the kernel, a lambda that specifies what to do per
        //    work item. The parameter of the lambda is the work item id.
        // DPC++ supports unnamed lambda kernel by default.
        h.parallel_for(num_items, [=](id<1> i) { sum[i] = a[i] + b[i]; });
      });
    }
    
    //************************************
    // Initialize the array from 0 to array_size - 1
    //************************************
    void InitializeArray(IntArray &a) {
      for (size_t i = 0; i < a.size(); i++) a[i] = i;
    }
    
    //************************************
    // Demonstrate vector add both in sequential on CPU and in parallel on device.
    //************************************
    int main() {
      // Create device selector for the device of your interest.
    #if FPGA_EMULATOR
      // DPC++ extension: FPGA emulator selector on systems without FPGA card.
      intel::fpga_emulator_selector d_selector;
    #elif FPGA
      // DPC++ extension: FPGA selector on systems with FPGA card.
      intel::fpga_selector d_selector;
    #else
      // The default device selector will select the most performant device.
      default_selector d_selector;
    #endif
    
      // Create array objects with "array_size" to store the input and output data.
      IntArray a, b, sum_sequential, sum_parallel;
    
      // Initialize input arrays with values from 0 to array_size - 1
      InitializeArray(a);
      InitializeArray(b);
    
      try {
        queue q(d_selector, dpc::exception_handler);
    
        // Print out the device information used for the kernel code.
        std::cout << "Running on device: "
                  << q.get_device().get_info<info::device::name>() << "\n";
        std::cout << "Vector size: " << a.size() << "\n";
    
        // Vector addition in DPC++
        VectorAdd(q, a, b, sum_parallel);
      } catch (exception const &e) {
        std::cout << "An exception is caught for vector add.\n";
        std::terminate();
      }
    
      // Compute the sum of two arrays in sequential for validation.
      for (size_t i = 0; i < sum_sequential.size(); i++)
        sum_sequential[i] = a[i] + b[i];
    
      // Verify that the two arrays are equal.
      for (size_t i = 0; i < sum_sequential.size(); i++) {
        if (sum_parallel[i] != sum_sequential[i]) {
          std::cout << "Vector add failed on device.\n";
          return -1;
        }
      }
    
      int indices[]{0, 1, 2, (a.size() - 1)};
      constexpr size_t indices_size = sizeof(indices) / sizeof(int);
    
      // Print out the result of vector add.
      for (int i = 0; i < indices_size; i++) {
        int j = indices[i];
        if (i == indices_size - 1) std::cout << "...\n";
        std::cout << "[" << j << "]: " << a[j] << " + " << b[j] << " = "
                  << sum_parallel[j] << "\n";
      }
    
      std::cout << "Vector add successfully completed on device.\n";
      return 0;
    }
    

    代码有了,就可以编译看看效果;在这里Intel提供的是非常完整的工具包,所以编译器调试器一应俱全。但是有一点就是没有环境变量,是无法使用这些工具的。Intel提供了环境变量终端,安装完成oneAPI后会提供一个终端:


    cmd.png

    如上图所示,打开这个终端,它会自动加载环境变量。


    加载环境变量成功
    而后我们需要做的是切换到源文件目录,手动编译即可。
    编译

    参考编译命令:

    dpcpp -O2 -g -std=C++17 -o vector-add-buffers.exe src/vector-add-buffers.cpp
    

    没有报错即编译成功,编译成功后,产生的文件如下图:


    文件结构

    最后,输入.\vector-add-buffers.exe来运行查看结果:

    结果

    可以看到用的GPU,两个拥有10000个元素的一维向量相加很快便执行出来了。有多快?就真的和你打印Hello World差不多快,所以源文件开始的说明里面也说了,两个元素个数相同的一维向量的相加,便是并行处理的Hello World。

    注意:这个编译过程切不可放进普通终端(cmd/powershell)中进行,因为环境变量的缘故。
    针对单个源文件,直接采用命令编译毫无问题,但是若是有很多个cpp和终端设备的工程文件,最好考虑Visual Studio这样的IDE,或者是CMake生成Makefile(比较推荐,跨平台工程常用)。在Github的sample code里面,Intel也给了vs的工程文件,可以直接clone整个工程,打开就可以用。

    相关文章

      网友评论

        本文标题:oneAPI的测试:Vector-add

        本文链接:https://www.haomeiwen.com/subject/ubpccktx.html