特征向量搜索 - cublas实现

1.背景

1.1.场景

在机器视觉领域,经常遇到如下的业务需求:

上述的场景都用到了机器视觉中的人脸搜索的功能,而通常人脸搜索需要经过人脸检测、人脸特征提取以及特征库搜索三个步骤

1.2.特征向量搜索

人脸库可以序列化成N条固定维度(dimension)、固定精度(precesion)的向量,而人脸库搜索搜索问题就能归约成K最近邻搜索(K-NN问题)。关于最近邻检索问题可以参考这篇文章 https://blog.csdn.net/lovego123/article/details/67638789

本文中的特征向量搜索并不允许精度损失,故不采取近似最近邻检索算法。在N条特征值的人脸库中搜索M条特征的TOP P个最相似特征,采用余弦距离来计算向量相似度

余弦距离公式

人脸特征提取的向量已经经过了归一化,故对于k维的人脸特征向量,特征库搜索问题可以看出n * k的矩阵点乘 k * m的矩阵

2.cublas实现矩阵相乘

2.1.cublas使用注意点

cublas实现矩阵相乘的基本原理和代码可以参考这篇博文 https://www.cnblogs.com/cuancuancuanhao/p/7763256.html

单精度(float32)矩阵相乘用的是cublasSgemm函数,官方文档链接

sgemm

需要特别注意的是

  1. 矩阵在内存或者显存中均是一维存储的,这就涉及了行优先存储/列优先存储
  2. cuda编程中,显存默认是列优先存储,这就意味着对于矩阵A,我们会以行优先存储在内存中,而拷贝到显存时,会自动按列优先存储,即得到的是A的转置
  3. C矩阵是列优先存储的,为了简化转置的过程,可以直接计算C的转置,即A矩阵的转置点乘B矩阵,对cublasSgemm函数两矩阵分别为CUBLAS_OP_T 和 CUBLAS_OP_N

2.2.样例代码

int matrixMultiplyCublas(int number, int batch, int dimension, int round, float sparse, bool verbose, float *duration)
{

    float *vector_a = new float[number];
    float *vector_b = new float[batch];
    cout.setf(ios::right);
    cout.precision(6);

    if (verbose)
    {
        int dense = 0;
        for (int i = 0; i < dimension; i++)
        {
            if (atom[i] > 0)
            {
                dense++;
            }
        }
        cout << "Sparse: " << (float)dense / (float)dimension;
    }


    float *d_A, *d_B, *d_C;
    unsigned int size_C = number * batch;
    unsigned int mem_size_C = sizeof(float) * size_C;
    float *h_CUBLAS = (float *)malloc(mem_size_C);
    cudaMalloc((void **)&d_A, sizeof(float) * number * dimension);
    cudaMalloc((void **)&d_B, sizeof(float) * batch * dimension);
    if (verbose)
    {
        cout << "Matrix A:" << endl;
    }
    int remain = number;
    do
    {
        int patch = PATCH;
        if (remain < patch)
        {
            patch = remain;
        }
        if (verbose)
        {
            for (int i = 0; i < patch; i++)
            {
                for (int j = 0; j < dimension; j++)
                {
                    cout << atom[i * dimension + j] << " ";
                }
                cout << endl;
            }
        }
        cudaMemcpy(d_A + (number - remain), atom, sizeof(float) * patch * dimension, cudaMemcpyHostToDevice);
        remain = remain - patch;
    } while (remain > 0);
    remain = batch;
    if (verbose)
    {
        cout << "Matrix B:" << endl;
    }
    do
    {
        int patch = PATCH;
        if (remain < patch)
        {
            patch = remain;
        }
        if (verbose)
        {
            for (int i = 0; i < patch; i++)
            {
                for (int j = 0; j < dimension; j++)
                {
                    cout << atom[i * dimension + j] << " ";
                }
                cout << endl;
            }
        }
        cudaMemcpy(d_B + (batch - remain), atom, sizeof(float) * patch * dimension, cudaMemcpyHostToDevice);
        remain = remain - patch;
    } while (remain > 0);
    cudaMalloc((void **)&d_C, mem_size_C);

    cout << "Init matrix A and B" << endl;
    dim3 threads(1, 1);
    dim3 grid(1, 1);

    //cuBLAS代码
    const float alpha = 1.0f;
    const float beta = 0.0f;
    int m = number, n = batch, k = dimension;
    clock_t start;
    start = clock();

    cublasHandle_t handle;
    cublasCreate(&handle);
    for (int i = 0; i < round; i++)
    {
        cublasStatus_t ret = cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, k, &alpha, d_A, k, d_B, k, &beta, d_C, m);
        if (ret != CUBLAS_STATUS_SUCCESS)
        {
            cout << "cublasSgemm error:" << ret << endl;
        }

        cudaMemcpy(h_CUBLAS, d_C, mem_size_C, cudaMemcpyDeviceToHost);
    }
    cublasDestroy(handle);
    cout << "Matrix mulitiply done" << endl;
    duration[0] = (float)(clock() - start) / CLOCKS_PER_SEC;

    if (verbose)
    {
        cout << "Result C:" << endl;
        for (int i = 0; i < batch * number; i++)
        {
            cout << h_CUBLAS[i] << " ";
            if ((i + 1) % number == 0)
                cout << endl;
        }
    }
    free(h_CUBLAS);
	cudaFree(d_A);
	cudaFree(d_B);
	cudaFree(d_C);
}
Table of Contents