参考
cuSOLVER API Reference 第2.4.5.1节至2.4.5.3
正文
2.4.5 密集特征值求解器参考(64位API)
2.4.5.1 cusolverDnXgesvd()
下面的辅助函数可以计算预分配缓冲区所需的大小。
cusolverStatus_t
cusolverDnXgesvd_bufferSize(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
signed char jobu,
signed char jobvt,
int64_t m,
int64_t n,
cudaDataType dataTypeA,
const void *A,
int64_t lda,
cudaDataType dataTypeS,
const void *S,
cudaDataType dataTypeU,
const void *U,
int64_t ldu,
cudaDataType dataTypeVT,
const void *VT,
int64_t ldvt,
cudaDataType computeType,
size_t *workspaceInBytesOnDevice,
size_t *workspaceInBytesOnHost)
下面的例程:
cusolverStatus_t
cusolverDnXgesvd(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
signed char jobu,
signed char jobvt,
int64_t m,
int64_t n,
cudaDataType dataTypeA,
void *A,
int64_t lda,
cudaDataType dataTypeS,
void *S,
cudaDataType dataTypeU,
void *U,
int64_t ldu,
cudaDataType dataTypeVT,
void *VT,
int64_t ldvt,
cudaDataType computeType,
void *bufferOnDevice,
size_t workspaceInBytesOnDevice,
void *bufferOnHost,
size_t workspaceInBytesOnHost,
int *info)
该函数计算m×n矩阵a的奇异值分解(SVD),并对应于左和/或右奇异向量。SVD写作
其中是一个m×n矩阵,除了它的
个对角元素外,它是零,
是一个m×m酉矩阵,
是一个n×n酉矩阵。
的对角元素是A的奇异值;它们是实数且非负数,并按降序返回。U和V的第一个
列是A的左右奇异向量。
用户必须提供由输入参数bufferOnDevice
和bufferOnHost
指向的设备和主机工作空间。输入参数workspaceInBytesOnDevice
(和workspaceInBytesOnHost
)是设备(和主机)工作空间的字节大小,由cusolverDnXgesvd_bufferSize()
返回。
如果输出参数info = -i
(小于零),则第i
个参数是错误的(不计算句柄)。如果bdsqr
不收敛,info
指定有多少中间双对角线形式的超对角线不收敛于零。
目前,cusolverDnXgesvd
只支持默认算法。
cusolverDnXgesvd
支持的算法表
CUSOLVER_ALG_0 or NULL
|
Default algorithm. |
---|
请访问cuSOLVER库示例- Xgesvd获取代码示例。
备注1: gesvd
只支持m>=n
。
备注2:例程返回,而不是
。
cusolverDnXgesvd_bufferSize
和cusolverDnXgesvd
的输入参数列表:
API of cusolverDnXgesvd
参数 | 内存 | 输入/输出 | 含义 |
---|---|---|---|
handle |
host |
输入 | cuSolverDN库上下文的句柄。 |
params |
host |
输入 | 结构体,包含由cusolverDnSetAdvOptions 收集的信息。 |
jobu |
host |
输入 | 指定计算矩阵'A' : U的所有m列在数组U中返回:= 'S' : U的前最小(m,n)列(左奇异向量)在数组U中返回;= 'O' : U的前'N' :不计算U的列(不计算左奇异向量)。 |
jobvt |
host |
输入 | 指定计算矩阵'A' : 'S' : 'O' : 'N' :不计算 |
m |
host |
输入 | 矩阵 |
n |
host |
输入 | 矩阵 |
dataTypeA |
host |
输入 | 矩阵 |
A |
device |
输入/输出 | 维度为lda * n 且lda 不小于max(1,m) 的数组。在输出,A 的内容被销毁。 |
lda |
host |
输入 | 用于存储矩阵A 的二维数组的前导维数。 |
dataTypeS |
host |
输入 | 矩阵 |
S |
device |
输出 | 维数为min(m,n) 的实数组。 |
dataTypeU |
host |
输入 | 矩阵 |
U |
device |
输出 | 尺寸为ldu * m 且ldu 不小于max(1,m) 的数组。 |
ldu |
host |
输入 | 用于存储矩阵 |
dataTypeVT |
host |
输入 | 矩阵 |
VT |
device |
输出 | cuSolverDN库上下文的句柄。 |
ldvt |
host |
输入 | 用于存储矩阵 |
computeType |
host |
输入 | 计算的数据类型。 |
bufferOnDevice |
device |
输入/输出 | 设备工作空间。void 类型的数组,大小为workspaceInBytesOnDevice 字节 |
workspaceInBytesOnDevice |
host |
输入 |
bufferOnDevice 的大小以字节为单位,由cusolverDnXpotrf_bufferSize 返回。 |
bufferOnHost |
host |
输入/输出 | 主机的工作空间。void 类型的数组,大小为workspaceInBytesOnHost 字节。 |
workspaceInBytesOnHost |
host |
输入 |
bufferOnHost 的字节大小,由cusolverDnXpotrf_bufferSize 返回。 |
info |
device |
输出 | 如果info = 0 ,表示操作成功。如果info = -i ,则第i个参数是错误的(不包括句柄)。如果info > 0 , info 表示有多少中间双对角线形式的超对角线不收敛于零。 |
泛型API有三种不同的类型,dataTypeA
是矩阵A
的数据类型,数据类型是向量年代的数据类型,dataTypeU
是矩阵U
的数据类型,dataTypeVT
是矩阵VT
的数据类型,computeType
是运算的计算类型。cusolverDnXgesvd
只支持以下四种组合。
数据类型和计算类型的有效组合
DataTypeA | DataTypeS | DataTypeU | DataTypeVT | ComputeType | Meaning |
---|---|---|---|---|---|
CUDA_R_32F | CUDA_R_32F | CUDA_R_32F | CUDA_R_32F | CUDA_R_32F | SGESVD |
CUDA_R_64F | CUDA_R_64F | CUDA_R_64F | CUDA_R_64F | CUDA_R_64F | DGESVD |
CUDA_C_32F | CUDA_R_32F | CUDA_C_32F | CUDA_C_32F | CUDA_C_32F | CGESVD |
CUDA_C_64F | CUDA_R_64F | CUDA_C_64F | CUDA_C_64F | CUDA_C_64F | ZGESVD |
Status Returned
CUSOLVER_STATUS_SUCCESS | 操作成功完成 |
---|---|
CUSOLVER_STATUS_NOT_INITIALIZED | 库未初始化。 |
CUSOLVER_STATUS_INVALID_VALUE | 传递无效参数(m,n<0 或lda<max(1,m) 或ldu<max(1,m) 或ldvt<max(1,n) )。 |
CUSOLVER_STATUS_INTERNAL_ERROR | 内部操作失败。 |
2.4.5.2 cusolverDnXgesvdp()
大部分都一样,除了局部几个参数和优化方法,所以描述一下优化方法就行。
其中是一个
m×n
矩阵,除了它的min(m,n)
个对角元素外,它是零,是一个
m×m
酉矩阵,是一个
n×n
酉矩阵。的对角元素是
的奇异值;它们是实数且非负数,并按降序返回。
和
的第一个
min(m,n)
列是的左右奇异向量。
cusolverDnXgesvdp
结合了[14]中的极坐标分解和cusolverDnXsyevd
计算SVD
。它比基于QR
算法的cusolverDnXgesvd
快得多。然而,当矩阵的奇异值接近于零时,[14]中的极坐标分解可能无法得到一个完整的酉矩阵。为了解决当奇异值接近于零时的问题,我们添加了一个小的扰动,以便极性分解可以提供正确的结果。其结果是不准确的奇异值被这种扰动移位。输出参数
h_err_sigma
是这个扰动的大小。换句话说,h_err_sigma
表示SVD的准确性。
用户必须提供由输入参数bufferOnDevice
和bufferOnHost
指向的设备和主机工作空间。输入参数workspaceInBytesOnDevice
(和workspaceInBytesOnHost
)是设备(和主机)工作空间的字节大小,由cusolverDnXgesvdp_bufferSize()
返回。
如果输出参数info = -i
(小于零),则第i
个参数是错误的(不计算句柄)。
2.4.5.3 cusolverDnXgesvdr()
该函数计算m×n
矩阵的近似秩-k奇异值分解(
k-SVD
)以及相应的左和/或右奇异向量。k-SVD
写成
其中是一个
k×k
矩阵,除了对角线元素外为零,是一个
m×k
标准正交矩阵,是一个
k×n
标准正交矩阵。的对角元是
的近似奇异值;它们是实数且非负数,并按降序返回。
和
的列是
的上
k
个左右奇异向量。
cusolverDnXgesvdr
实现了[15]中描述的随机化方法来计算k-SVD
,如果[15]中描述的条件成立,则k-SVD
具有高概率的准确性。cusolverDnXgesvdr
旨在计算光谱的很小一部分(意味着k
与min(m,n)
相比非常小)。快速且质量好,特别是当矩阵的尺寸很大时。
该方法的准确性取决于的谱、幂次迭代的次数、过采样参数
p
以及p
与矩阵A
维数之比。过大的过采样p
值或较大的迭代次数可能会产生更精确的近似值,但也会增加cusolverDnXgesvdr
的运行时间。
我们的建议是使用两次迭代,并将过采样设置为至少2k
。一旦求解器提供了足够的精度,调整k
和niters
的值以获得更好的性能。
用户必须提供由输入参数bufferOnDevice
和bufferOnHost
指向的设备和主机工作空间。输入参数workspaceInBytesOnDevice
(和workspaceInBytesOnHost
)是以字节为单位的设备(和主机)工作空间大小,由cusolverDnXgesvdr_bufferSize()
返回。
如果输出参数info = -i
(小于零),则第i
个参数是错误的(不计算句柄)。
示例的代码
/*
* Copyright 2020 NVIDIA Corporation. All rights reserved.
*
* NOTICE TO LICENSEE:
*
* This source code and/or documentation ("Licensed Deliverables") are
* subject to NVIDIA intellectual property rights under U.S. and
* international Copyright laws.
*
* These Licensed Deliverables contained herein is PROPRIETARY and
* CONFIDENTIAL to NVIDIA and is being provided under the terms and
* conditions of a form of NVIDIA software license agreement by and
* between NVIDIA and Licensee ("License Agreement") or electronically
* accepted by Licensee. Notwithstanding any terms or conditions to
* the contrary in the License Agreement, reproduction or disclosure
* of the Licensed Deliverables to any third party without the express
* written consent of NVIDIA is prohibited.
*
* NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
* LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE
* SUITABILITY OF THESE LICENSED DELIVERABLES FOR ANY PURPOSE. IT IS
* PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND.
* NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE LICENSED
* DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY,
* NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
* NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
* LICENSE AGREEMENT, IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY
* SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY
* DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
* WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
* ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
* OF THESE LICENSED DELIVERABLES.
*
* U.S. Government End Users. These Licensed Deliverables are a
* "commercial item" as that term is defined at 48 C.F.R. 2.101 (OCT
* 1995), consisting of "commercial computer software" and "commercial
* computer software documentation" as such terms are used in 48
* C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government
* only as a commercial end item. Consistent with 48 C.F.R.12.212 and
* 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all
* U.S. Government End Users acquire the Licensed Deliverables with
* only those rights set forth herein.
*
* Any use of the Licensed Deliverables in individual and commercial
* software must include, in the user documentation and internal
* comments to the code, the above Disclaimer and U.S. Government End
* Users Notice.
*/
#include <cstdio>
#include <cstdlib>
#include <vector>
#include <cuda_runtime.h>
#include <cusolverDn.h>
#include "cusolver_utils.h"
int main(int argc, char *argv[]) {
cusolverDnHandle_t cusolverH = NULL;
cublasHandle_t cublasH = NULL;
cudaStream_t stream = NULL;
using data_type = double;
const int64_t m = 3;
const int64_t n = 2;
const int64_t lda = m;
/* | 1 2 |
* A = | 4 5 |
* | 2 1 |
*/
const std::vector<data_type> A = {1.0, 4.0, 2.0, 2.0, 5.0, 1.0};
std::vector<data_type> U(lda * m, 0);
std::vector<data_type> VT(lda * n, 0);
std::vector<data_type> S(n, 0);
std::vector<data_type> S_exact(n, 0);
data_type *d_A = nullptr;
data_type *d_S = nullptr;
data_type *d_U = nullptr;
data_type *d_VT = nullptr;
int *d_info = nullptr;
data_type *d_work = nullptr;
data_type *h_work = nullptr;
data_type *d_rwork = nullptr;
data_type *d_W = nullptr; // W = S*VT
size_t workspaceInBytesOnDevice = 0;
size_t workspaceInBytesOnHost = 0;
int info = 0;
const data_type h_one = 1;
const data_type h_minus_one = -1;
std::printf("A = (matlab base-1)\n");
print_matrix(m, n, A.data(), lda);
std::printf("=====\n");
/* step 1: create cusolver handle, bind a stream */
CUSOLVER_CHECK(cusolverDnCreate(&cusolverH));
CUBLAS_CHECK(cublasCreate(&cublasH));
CUDA_CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));
CUSOLVER_CHECK(cusolverDnSetStream(cusolverH, stream));
CUBLAS_CHECK(cublasSetStream(cublasH, stream));
/* step 2: copy A to device */
CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_A), sizeof(data_type) * A.size()));
CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_S), sizeof(data_type) * S.size()));
CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_U), sizeof(data_type) * U.size()));
CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_VT), sizeof(data_type) * VT.size()));
CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_info), sizeof(int)));
CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_W), sizeof(data_type) * lda * n));
CUDA_CHECK(cudaMemcpyAsync(d_A, A.data(), sizeof(data_type) * lda * n, cudaMemcpyHostToDevice,
stream));
signed char jobu = 'A'; // all m columns of U
signed char jobvt = 'A'; // all n columns of VT
/* step 3: query working space of SVD */
CUSOLVER_CHECK(cusolverDnXgesvd_bufferSize(
cusolverH,
nullptr,
jobu,
jobvt,
m,
n,
traits<data_type>::cuda_data_type,
d_A,
lda,
traits<data_type>::cuda_data_type,
d_S,
traits<data_type>::cuda_data_type,
d_U,
lda,
traits<data_type>::cuda_data_type,
d_VT,
lda,
traits<data_type>::cuda_data_type,
&workspaceInBytesOnDevice,
&workspaceInBytesOnHost));
CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_work), workspaceInBytesOnDevice));
h_work = (data_type*)malloc(workspaceInBytesOnHost);
if (!h_work)
{
throw std::bad_alloc();
}
/* step 4: compute SVD */
CUSOLVER_CHECK(cusolverDnXgesvd(
cusolverH,
nullptr,
jobu,
jobvt,
m,
n,
traits<data_type>::cuda_data_type,
d_A,
lda,
traits<data_type>::cuda_data_type,
d_S,
traits<data_type>::cuda_data_type,
d_U,
lda,
traits<data_type>::cuda_data_type,
d_VT,
lda,
traits<data_type>::cuda_data_type,
d_work,
workspaceInBytesOnDevice,
h_work,
workspaceInBytesOnHost,
d_info));
CUDA_CHECK(cudaMemcpyAsync(U.data(), d_U, sizeof(data_type) * U.size(), cudaMemcpyDeviceToHost,
stream));
CUDA_CHECK(cudaMemcpyAsync(VT.data(), d_VT, sizeof(data_type) * VT.size(),
cudaMemcpyDeviceToHost, stream));
CUDA_CHECK(cudaMemcpyAsync(S.data(), d_S, sizeof(data_type) * S.size(), cudaMemcpyDeviceToHost,
stream));
CUDA_CHECK(cudaMemcpyAsync(&info, d_info, sizeof(int), cudaMemcpyDeviceToHost, stream));
CUDA_CHECK(cudaStreamSynchronize(stream));
std::printf("after Xgesvd: info = %d\n", info);
if (0 > info) {
std::printf("%d-th parameter is wrong \n", -info);
exit(1);
}
std::printf("=====\n");
std::printf("S = (matlab base-1)\n");
print_matrix(n, 1, S.data(), lda);
std::printf("=====\n");
std::printf("U = (matlab base-1)\n");
print_matrix(m, m, U.data(), lda);
std::printf("=====\n");
std::printf("VT = (matlab base-1)\n");
print_matrix(n, n, VT.data(), lda);
std::printf("=====\n");
// step 5: measure error of singular value
double ds_sup = 0;
for (int j = 0; j < n; j++) {
double err = fabs(S[j] - S_exact[j]);
ds_sup = (ds_sup > err) ? ds_sup : err;
}
std::printf("|S - S_exact| = %E \n", ds_sup);
// step 6: |A - U*S*VT|
// W = S*VT
CUBLAS_CHECK(cublasDdgmm(cublasH, CUBLAS_SIDE_LEFT, n, n, d_VT, lda, d_S, 1, d_W, lda));
CUDA_CHECK(cudaMemcpyAsync(d_A, A.data(), sizeof(data_type) * A.size(), cudaMemcpyHostToDevice,
stream));
CUBLAS_CHECK(cublasDgemm_v2(cublasH,
CUBLAS_OP_N, // U
CUBLAS_OP_N, // W
m, // number of rows of A
n, // number of columns of A
n, // number of columns of U
&h_minus_one, /* host pointer */
d_U, // U
lda,
d_W, // W
lda, &h_one, /* hostpointer */
d_A, lda));
double dR_fro = 0.0;
CUBLAS_CHECK(cublasDnrm2_v2(cublasH, A.size(), d_A, 1, &dR_fro));
CUDA_CHECK(cudaStreamSynchronize(stream));
std::printf("|A - U*S*VT| = %E \n", dR_fro);
/* free resources */
CUDA_CHECK(cudaFree(d_A));
CUDA_CHECK(cudaFree(d_S));
CUDA_CHECK(cudaFree(d_U));
CUDA_CHECK(cudaFree(d_VT));
CUDA_CHECK(cudaFree(d_info));
CUDA_CHECK(cudaFree(d_work));
CUDA_CHECK(cudaFree(d_rwork));
CUDA_CHECK(cudaFree(d_W));
free(h_work);
CUSOLVER_CHECK(cusolverDnDestroy(cusolverH));
CUBLAS_CHECK(cublasDestroy(cublasH));
CUDA_CHECK(cudaStreamDestroy(stream));
CUDA_CHECK(cudaDeviceReset());
return EXIT_SUCCESS;
}
网友评论