特征向量搜索 - cublas实现
1.背景
1.1.场景
在机器视觉领域,经常遇到如下的业务需求:
- 两张人脸照片是否是属于同一个人
- 小区出入口监控视频分析进出人员是否属于小区业主
- 车站监控分析是否有黄牛、惯偷出没
- 聚会大量合照,按出现人物进行聚类归档
上述的场景都用到了机器视觉中的人脸搜索的功能,而通常人脸搜索需要经过人脸检测、人脸特征提取以及特征库搜索三个步骤
1.2.特征向量搜索
人脸库可以序列化成N条固定维度(dimension)、固定精度(precesion)的向量,而人脸库搜索搜索问题就能归约成K最近邻搜索(K-NN问题)。关于最近邻检索问题可以参考这篇文章 https://blog.csdn.net/lovego123/article/details/67638789
本文中的特征向量搜索并不允许精度损失,故不采取近似最近邻检索算法。在N条特征值的人脸库中搜索M条特征的TOP P个最相似特征,采用余弦距离来计算向量相似度
人脸特征提取的向量已经经过了归一化,故对于k维的人脸特征向量,特征库搜索问题可以看出n * k的矩阵点乘 k * m的矩阵
2.cublas实现矩阵相乘
2.1.cublas使用注意点
cublas实现矩阵相乘的基本原理和代码可以参考这篇博文 https://www.cnblogs.com/cuancuancuanhao/p/7763256.html
单精度(float32)矩阵相乘用的是cublasSgemm函数,官方文档链接
需要特别注意的是
- 矩阵在内存或者显存中均是一维存储的,这就涉及了行优先存储/列优先存储
- cuda编程中,显存默认是列优先存储,这就意味着对于矩阵A,我们会以行优先存储在内存中,而拷贝到显存时,会自动按列优先存储,即得到的是A的转置
- C矩阵是列优先存储的,为了简化转置的过程,可以直接计算C的转置,即A矩阵的转置点乘B矩阵,对cublasSgemm函数两矩阵分别为CUBLAS_OP_T 和 CUBLAS_OP_N
2.2.样例代码
int matrixMultiplyCublas(int number, int batch, int dimension, int round, float sparse, bool verbose, float *duration)
{
float *vector_a = new float[number];
float *vector_b = new float[batch];
cout.setf(ios::right);
cout.precision(6);
if (verbose)
{
int dense = 0;
for (int i = 0; i < dimension; i++)
{
if (atom[i] > 0)
{
dense++;
}
}
cout << "Sparse: " << (float)dense / (float)dimension;
}
float *d_A, *d_B, *d_C;
unsigned int size_C = number * batch;
unsigned int mem_size_C = sizeof(float) * size_C;
float *h_CUBLAS = (float *)malloc(mem_size_C);
cudaMalloc((void **)&d_A, sizeof(float) * number * dimension);
cudaMalloc((void **)&d_B, sizeof(float) * batch * dimension);
if (verbose)
{
cout << "Matrix A:" << endl;
}
int remain = number;
do
{
int patch = PATCH;
if (remain < patch)
{
patch = remain;
}
if (verbose)
{
for (int i = 0; i < patch; i++)
{
for (int j = 0; j < dimension; j++)
{
cout << atom[i * dimension + j] << " ";
}
cout << endl;
}
}
cudaMemcpy(d_A + (number - remain), atom, sizeof(float) * patch * dimension, cudaMemcpyHostToDevice);
remain = remain - patch;
} while (remain > 0);
remain = batch;
if (verbose)
{
cout << "Matrix B:" << endl;
}
do
{
int patch = PATCH;
if (remain < patch)
{
patch = remain;
}
if (verbose)
{
for (int i = 0; i < patch; i++)
{
for (int j = 0; j < dimension; j++)
{
cout << atom[i * dimension + j] << " ";
}
cout << endl;
}
}
cudaMemcpy(d_B + (batch - remain), atom, sizeof(float) * patch * dimension, cudaMemcpyHostToDevice);
remain = remain - patch;
} while (remain > 0);
cudaMalloc((void **)&d_C, mem_size_C);
cout << "Init matrix A and B" << endl;
dim3 threads(1, 1);
dim3 grid(1, 1);
//cuBLAS代码
const float alpha = 1.0f;
const float beta = 0.0f;
int m = number, n = batch, k = dimension;
clock_t start;
start = clock();
cublasHandle_t handle;
cublasCreate(&handle);
for (int i = 0; i < round; i++)
{
cublasStatus_t ret = cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, k, &alpha, d_A, k, d_B, k, &beta, d_C, m);
if (ret != CUBLAS_STATUS_SUCCESS)
{
cout << "cublasSgemm error:" << ret << endl;
}
cudaMemcpy(h_CUBLAS, d_C, mem_size_C, cudaMemcpyDeviceToHost);
}
cublasDestroy(handle);
cout << "Matrix mulitiply done" << endl;
duration[0] = (float)(clock() - start) / CLOCKS_PER_SEC;
if (verbose)
{
cout << "Result C:" << endl;
for (int i = 0; i < batch * number; i++)
{
cout << h_CUBLAS[i] << " ";
if ((i + 1) % number == 0)
cout << endl;
}
}
free(h_CUBLAS);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}