How GPU computing literally saved me at work. Python example
Today we touch on the most relevant topic – Python for working with the GPU. The author considers an example, trivial in its monstrosity, and demonstrates the solution, accompanying it with extensive listings. Enjoy reading!
None of us, in one form or another, have bypassed the recent hype around GPU computing. Before you read further, I’ll explain: I am not a GPU expert. My journey in the GPU world is just beginning. But today this technology has reached such power that, armed with it, you can solve a whole lot of problems. I was assigned a task at work, the machine spent hours on it, and no progress was visible. But, as soon as I took up the GPU – and the problem began to be solved in seconds. The task, which approximately took 2 days to complete, I was able to solve in just 20 seconds .
In the following sections, I will describe this task in detail. We will also discuss how and when to use the GPU to solve any such problems. So, we read carefully – believe me, you will not regret it. First, we will delve into the details of the task, then get comfortable with the GPU and, finally, use the GPU to solve this problem. I will use the Python Numba library and the Nvidia Volta V100 16GB GPU GPU .
1. Detailed description of the task
In the retail sector, you often have to look for similar or closest objects. I was given a list of positions, each of which was represented by k latent attributes. So, I was instructed to find the top 3 most similar positions to each of the positions in the list. The metric of similarity in this problem was chosen cosine similarity. This is what my data looked like.
List of data items with 64 latent characteristics
I was given a list in which there were about 10⁵ positions. Searching for the 3 most similar positions for each of them would require checking the cosine similarity with each and every element in the list. It would turn out n * k operations, where n is the number of positions, and k are the attributes for each position. It would be necessary to obtain the scalar product of this position from each of the remaining positions in the list.
def top_3_similar_items(X,mindex): # X это список позиций, нам требуется найти топ-3 позиций, наиболее похожих на позицию с индексом 'mindex' SMALL = -9999.0 # Просто для инициализации, поскольку нам нужно найти позицию с максимальным уровнем сходства(косинусный показатель) first_best_val = SMALL # индекс первой по схожести позиции будет храниться здесь first_best_index = -1 # косинусное сходство с первой по схожести позицией second_best_val = SMALL # индекс второй по схожести позиции будет храниться здесь second_best_index = -1 # косинусное сходство со второй по схожести позицией third_best_val = SMALL # индекс третьей по схожести позиции будет храниться здесь third_best_index = -1 # косинусное сходство с третьей по схожести позицией for j in range(n): # Всего n-позиций в списке if(mindex==j): continue tmp = np.dot(X[mindex],X[j])/(np.sqrt(np.sum(np.square(X[mindex]))) * np.sqrt(np.sum(np.square(X[j])))) # вычисление скалярного произведения if(tmp>=first_best_val): third_best_val = second_best_val third_best_index = second_best_index second_best_val = first_best_val second_best_index = first_best_index first_best_val = tmp first_best_index = j elif(tmp>=second_best_val): third_best_val = second_best_val third_best_index = second_best_index second_best_val = tmp second_best_index = j elif(tmp>third_best_val): third_best_val = tmp third_best_index = j return first_best_val,first_best_index,second_best_val,second_best_index,third_best_val,third_best_index # Возврат всех значений
Code for finding three positions that are as similar as possible to the
Now, when you find the top 3 similar for all positions in the list, the complexity is multiplied by another n. The final complexity is O (n * n * k) = O (n²k).
Code for finding the three most similar positions for each position in the
Test run and time evaluation list
I ran the code, trying to find the 3 most similar positions from a subset containing n = 10³ positions with k = 64. It took about 17 to complete this task with Python seconds at an average speed of 3.7 * 10⁶ operations per second. The code has been well optimized using Numpy operations and arrays. I note that all these operations are sequentially performed on the CPU.
Run for n = 10³ positions
Conclusion: the time it took for n = 10³ positions
Next, I increased the test subset to n = 10⁴ positions. Since the complexity is O (n²k), the execution time has increased 100 times (since n has increased 10 times). It took 1700 seconds to complete the code = 28.33 minutes.
Conclusion: the time it took to process n = 10⁴ positions
Next, we come to the most important one: estimating the time it takes to process a full list of 10⁵ positions. Counting, we see that the time complexity again increases by 100 times, since the time complexity of the algorithm is O (n²k).
Estimated time = 1700 * 100 seconds = 2834 minutes = 47.2 hours ~ 2 days.
Oh my God! So long!!!
You probably already realize that I actually managed to do everything very quickly, using the power of the GPU. In fact, the time gain when using a GPU is simply shocking. I’ll leave the numbers for a snack, but for now I suggest you get acquainted with the world of GPU.
2. Comparison of CPU and GPU
The Central Processing Unit (CPU), in essence, is the brain of any computing device: it executes the instructions written in the program, performing control, logical operations, as well as input / output operations.
True, modern CPUs still do not have many cores, and the basic structure and purpose of the CPU – the processing of complex calculations – in essence, have not changed. In fact, the CPU is best suited to solve problems involving parsing or interpreting the complex logic contained in the code.
In turn, the graphic processor (GPU) has smaller logical cores, which, however, are much larger (we are talking about arithmetic logic devices (ALU), control elements and cache memory), which are designed with the expectation of parallel processing as a whole identical and relatively simple operations.
The GPU has more arithmetic logic units (ALUs) than a typical CPU, so the ability to process simple operations in parallel has been enhanced
Figure: Compare CPU and GPU ~ Image Source
3. When to use GPU computing
The CPU is better suited for complex linear tasks. Despite the fact that the CPU cores are more powerful, GPUs allow you to more efficiently and quickly process tasks related to AI, machine learning and deep learning. The GPU handles workloads by parallelizing similar operations.
The idea of operations from the point of view of the GPU: take, for example, the operation of finding a word in a document. It can be done sequentially, sorting out one by one all the words in the document, either in parallel, that is, line by line, or with a search for a specific word.
The idea of operations from the point of view of the CPU – there are such operations, for example, the calculation of the Fibonacci series, which cannot be parallelized. After all, you can find the next number only after you calculate the previous two. Such operations are best suited for the CPU.
In turn, operations such as matrix addition and multiplication are easily performed using the GPU, since most of these operations in matrix cells are independent of each other, are similar in nature, and therefore can be parallelized.
4. Briefly about CUDA
CUDA is a parallel computing platform and API model created by Nvidia. Using this API, you can use a processor that supports CUDA graphics for a wide range of calculations. This approach is called GPGPU (non-specialized computing on GPUs). Here they are described in more detail .
Numba is a freeware JIT compiler that translates a subset of Python and NumPy into fast machine code using LLVM, this is done using the llvmlite package in Python. This package offers a number of options for parallelizing Python code for the CPU and GPU, while minimal changes are often enough in the code itself. See more .
Working with the processorNvidia Volta V100 16GB GPU , I used the Numba library.
Architectural Details ~ Streams, Blocks, and Grids
CUDA organizes parallel computing using abstractions such as streams, blocks, and grids.
Stream : A CUDA stream is an assigned chain of instructions coming / current to the CUDA core (in fact, it’s just a pipeline). On the fly, up to 32 threads can exist on the same CUDA core (in this case, all links of this pipeline are filled). This is the execution of the kernel with the given index. Each thread uses its index to access the elements in the array in such a way that the entire set of available threads processes the entire set of data together.
Block: This is a thread group. Not much can be said about the execution of threads within a block: they can be executed either sequentially or in parallel, or without any particular order. You can coordinate threads, for example, using a function
_syncthreads()that forces a thread to stop at a certain point in the kernel and wait until all other threads in the same block also reach the same point.
Grid : This is a group of blocks. There is no synchronization between the blocks.
CUDA: threads, blocks, grids ~ Source
But where exactly are threads, blocks and grids executed? In the case of Nvidia’s G80 GPU architecture, the calculations are apparently distributed as follows:
Grid → GPU: The entire grid is processed by a single GPU processor.
Block → MP: The GPU processor is organized as a collection of multiprocessors, where each multiprocessor is responsible for processing one or more blocks in the grid. One block is never divided between several MPs.
Stream → PP: Each MP is further subdivided into stream processors (PP), and each PP processes one or more threads in a block.
I borrowed some material from this excellently written article . I recommend reading it carefully.
5. A simple program for adding arrays in Python using the GPU
As I said at the very beginning, the essence of this article is to help a wide audience understand the power of the GPU and gain an intuitive understanding of how to use the GPU to solve everyday tasks. Before you start writing code for the GPU, you may have to do some preliminary research. To do this, let’s take an example of a program for adding arrays.
Suppose we have two arrays, ‘a’ and ‘b’ of size ‘n’. We want to generate an array ‘c’ such that each element of array c is the sum of elements with the same indices from the arrays ‘a’ and ‘b’. But in this case, to solve the problem, we will not use sequential calculations, but parallel ones that are done using the GPU.
We will launch n threads / cores. The index under which each specific thread works can be derived from the following formula:
index = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
In the case of a two-dimensional matrix, the index contains two components that mean a row and a column, which can be displayed as follows: We also have to determine the number of threads per block, say, tpb and blocks per grid, say bpg. We will use standard numbers for them. It is important to note another important concept here: when any calculations have to be performed on the GPU, the corresponding data must be transferred to the global memory of the GPU, and the results of the calculations can then be transferred back to the host. These operations are performed using the and functions provided in the Python Numba library.
row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
Below are the implementations of this solution for both the CPU and GPU. See both listings for the difference.
Sequential implementation for CPU.
Parallel implementation for GPU
6. Finding the 3 most similar positions for each position in the list using the GPU
Having thoroughly studied theory and practice, we return to the original task: to find the top 3 similar positions for each position in the list using GPU calculations.
In this case, the main idea is that we have n positions, and we will start n threads. Each thread will work in parallel with the others and independently of them, calculating the 3 most similar positions for each position in the list. Each position will be responsible for one thread.
from numba import cuda # Nvidia's GPU Library import numpy as np import math #Выполнение математических вычислений, например, нахождение скалярного произведения, Numpy не поддерживается на GPU import random # Определение функции ядра @cuda.jit('void(float32[:,:],float32[:],int32[:],float32[:],int32[:],float32[:],int32[:])') # CUDA Just-in-time Compiler def cuda_dist(X,first_best_val,first_best_index,second_best_val,second_best_index,third_best_val,third_best_index): """Поток будет выполнять эту функцию ядра.""" # Отображение потока на строку row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x; if ((row >= n)): # Принимать потоки только в границах матриц return first_best_val[row] = SMALL # косинусное сходство с первой по схожести позицией first_best_index[row] = -1 # индекс первой по схожести позиции будет храниться здесь second_best_val[row] = SMALL # косинусное сходство со второй по схожести позицией second_best_index[row] = -1 # индекс второй по схожести позиции будет храниться здесь third_best_val[row] = SMALL # косинусное сходство с третьей по схожести позицией third_best_index[row] = -1 # индекс третьей по схожести позиции будет храниться здесь for i in range(n): if(i==row): # Тот же элемент continue # Переменные для вычисления скалярного произведения, так как мы не можем использовать операции numpy в контексте GPU tmp = 0.0 magnitude1 = 0.0 magnitude2 = 0.0 for j in range(k): tmp += X[row,j] * X[i,j] magnitude1 += (X[row,j]* X[row,j]) magnitude2 += (X[i,j]* X[i,j]) tmp /= (math.sqrt(magnitude1)*math.sqrt(magnitude2)) # Значение скалярного произведения Dot_product(a,b) = a.b/(|a|*|b|) if(tmp>=first_best_val[row]): third_best_val[row] = second_best_val[row] third_best_index[row] = second_best_index[row] second_best_val[row] = first_best_val[row] second_best_index[row] = first_best_index[row] first_best_val[row] = tmp first_best_index[row] = i elif(tmp>=second_best_val[row]): third_best_val[row] = second_best_val[row] third_best_index[row] = second_best_index[row] second_best_val[row] = tmp second_best_index[row] = i elif(tmp>third_best_val[row]): third_best_val[row] = tmp third_best_index[row] = i # Подробности об устройстве device = cuda.get_current_device() d_x = cuda.to_device(X) d_first_val = cuda.device_array_like(first_val) d_first_index = cuda.device_array_like(first_index) d_second_val = cuda.device_array_like(second_val) d_second_index = cuda.device_array_like(second_index) d_third_val = cuda.device_array_like(third_val) d_third_index = cuda.device_array_like(third_index) tpb = device.WARP_SIZE # blocksize или количество потоков на блок bpg = int(np.ceil((n)/tpb)) # блоков на грид %time cuda_dist[bpg,tpb](d_x,d_first_val,d_first_index,d_second_val,d_second_index,d_third_val,d_third_index) # вызов ядра # Перенос вывода с устройства на хост first_val = d_first_val.copy_to_host() print (first_val[:10]) # Первые 10 значений # Перенос вывода с устройства на хост first_index = d_first_index.copy_to_host() print (first_index[:10]) # Первые 10 индексов
Implementation for the GPU ~ Code for finding the 3 positions most similar to the given
spent The total time spent by the GPU to find the top 3 similar positions for each position in the list was 481 ms (0.5 seconds). It took another 20 seconds to copy data from the device to the host and from the host to the device.
The task, the solution of which on the CPU would take about 2 days, on the GPU was solved in 20.5 seconds. This was possible only due to the nature of the task. The search for the 3 most similar positions for ‘A’ does not depend on the search for the 3 most similar positions for ‘B’. We took advantage of this fact and applied the parallelism provided by the GPU to speed up the process. The example also illustrates what types of tasks are most conveniently solved with the help of a powerful GPU.
This study was performed on Walmart Labs MLP (machine learning platform) . Thanks to Ayush Kumar for helping streamline the workflow.