Day: June 6, 2020
Today we touch on the most relevant topic – Python for working with the GPU. The author considers an example, trivial in its monstrosity, and demonstrates the solution, accompanying it with extensive listings. Enjoy reading!
None of us, in one form or another, have bypassed the recent hype around GPU computing. Before you read further, I’ll explain: I am not a GPU expert. My journey in the GPU world is just beginning. But today this technology has reached such power that, armed with it, you can solve a whole lot of problems. I was assigned a task at work, the machine spent hours on it, and no progress was visible. But, as soon as I took up the GPU – and the problem began to be solved in seconds. The task, which approximately took 2 days to complete, I was able to solve in just 20 seconds .
In the following sections, I will describe this task in detail. We will also discuss how and when to use the GPU to solve any such problems. So, we read carefully – believe me, you will not regret it. First, we will delve into the details of the task, then get comfortable with the GPU and, finally, use the GPU to solve this problem. I will use the Python Numba library and the Nvidia Volta V100 16GB GPU GPU .
In the retail sector, you often have to look for similar or closest objects. I was given a list of positions, each of which was represented by k latent attributes. So, I was instructed to find the top 3 most similar positions to each of the positions in the list. The metric of similarity in this problem was chosen cosine similarity. This is what my data looked like.
List of data items with 64 latent characteristics
I was given a list in which there were about 10⁵ positions. Searching for the 3 most similar positions for each of them would require checking the cosine similarity with each and every element in the list. It would turn out n * k operations, where n is the number of positions, and k are the attributes for each position. It would be necessary to obtain the scalar product of this position from each of the remaining positions in the list.
def top_3_similar_items(X,mindex): # X это список позиций, нам требуется найти топ-3 позиций, наиболее похожих на позицию с индексом 'mindex' SMALL = -9999.0 # Просто для инициализации, поскольку нам нужно найти позицию с максимальным уровнем сходства(косинусный показатель) first_best_val = SMALL # индекс первой по схожести позиции будет храниться здесь first_best_index = -1 # косинусное сходство с первой по схожести позицией second_best_val = SMALL # индекс второй по схожести позиции будет храниться здесь second_best_index = -1 # косинусное сходство со второй по схожести позицией third_best_val = SMALL # индекс третьей по схожести позиции будет храниться здесь third_best_index = -1 # косинусное сходство с третьей по схожести позицией for j in range(n): # Всего n-позиций в списке if(mindex==j): continue tmp = np.dot(X[mindex],X[j])/(np.sqrt(np.sum(np.square(X[mindex]))) * np.sqrt(np.sum(np.square(X[j])))) # вычисление скалярного произведения if(tmp>=first_best_val): third_best_val = second_best_val third_best_index = second_best_index second_best_val = first_best_val second_best_index = first_best_index first_best_val = tmp first_best_index = j elif(tmp>=second_best_val): third_best_val = second_best_val third_best_index = second_best_index second_best_val = tmp second_best_index = j elif(tmp>third_best_val): third_best_val = tmp third_best_index = j return first_best_val,first_best_index,second_best_val,second_best_index,third_best_val,third_best_index # Возврат всех значений
Code for finding three positions that are as similar as possible to the
Now, when you find the top 3 similar for all positions in the list, the complexity is multiplied by another n. The final complexity is O (n * n * k) = O (n²k).
Code for finding the three most similar positions for each position in the
Test run and time evaluation list
I ran the code, trying to find the 3 most similar positions from a subset containing n = 10³ positions with k = 64. It took about 17 to complete this task with Python seconds at an average speed of 3.7 * 10⁶ operations per second. The code has been well optimized using Numpy operations and arrays. I note that all these operations are sequentially performed on the CPU.
Run for n = 10³ positions
Conclusion: the time it took for n = 10³ positions
Next, I increased the test subset to n = 10⁴ positions. Since the complexity is O (n²k), the execution time has increased 100 times (since n has increased 10 times). It took 1700 seconds to complete the code = 28.33 minutes.
Conclusion: the time it took to process n = 10⁴ positions
Next, we come to the most important one: estimating the time it takes to process a full list of 10⁵ positions. Counting, we see that the time complexity again increases by 100 times, since the time complexity of the algorithm is O (n²k).
Estimated time = 1700 * 100 seconds = 2834 minutes = 47.2 hours ~ 2 days.
Oh my God! So long!!!
You probably already realize that I actually managed to do everything very quickly, using the power of the GPU. In fact, the time gain when using a GPU is simply shocking. I’ll leave the numbers for a snack, but for now I suggest you get acquainted with the world of GPU.
The Central Processing Unit (CPU), in essence, is the brain of any computing device: it executes the instructions written in the program, performing control, logical operations, as well as input / output operations.
True, modern CPUs still do not have many cores, and the basic structure and purpose of the CPU – the processing of complex calculations – in essence, have not changed. In fact, the CPU is best suited to solve problems involving parsing or interpreting the complex logic contained in the code.
In turn, the graphic processor (GPU) has smaller logical cores, which, however, are much larger (we are talking about arithmetic logic devices (ALU), control elements and cache memory), which are designed with the expectation of parallel processing as a whole identical and relatively simple operations.
The GPU has more arithmetic logic units (ALUs) than a typical CPU, so the ability to process simple operations in parallel has been enhanced
Figure: Compare CPU and GPU ~ Image Source
The CPU is better suited for complex linear tasks. Despite the fact that the CPU cores are more powerful, GPUs allow you to more efficiently and quickly process tasks related to AI, machine learning and deep learning. The GPU handles workloads by parallelizing similar operations.
The idea of operations from the point of view of the GPU: take, for example, the operation of finding a word in a document. It can be done sequentially, sorting out one by one all the words in the document, either in parallel, that is, line by line, or with a search for a specific word.
The idea of operations from the point of view of the CPU – there are such operations, for example, the calculation of the Fibonacci series, which cannot be parallelized. After all, you can find the next number only after you calculate the previous two. Such operations are best suited for the CPU.
In turn, operations such as matrix addition and multiplication are easily performed using the GPU, since most of these operations in matrix cells are independent of each other, are similar in nature, and therefore can be parallelized.
CUDA is a parallel computing platform and API model created by Nvidia. Using this API, you can use a processor that supports CUDA graphics for a wide range of calculations. This approach is called GPGPU (non-specialized computing on GPUs). Here they are described in more detail .
Numba is a freeware JIT compiler that translates a subset of Python and NumPy into fast machine code using LLVM, this is done using the llvmlite package in Python. This package offers a number of options for parallelizing Python code for the CPU and GPU, while minimal changes are often enough in the code itself. See more .
Working with the processorNvidia Volta V100 16GB GPU , I used the Numba library.
Architectural Details ~ Streams, Blocks, and Grids
CUDA organizes parallel computing using abstractions such as streams, blocks, and grids.
Stream : A CUDA stream is an assigned chain of instructions coming / current to the CUDA core (in fact, it’s just a pipeline). On the fly, up to 32 threads can exist on the same CUDA core (in this case, all links of this pipeline are filled). This is the execution of the kernel with the given index. Each thread uses its index to access the elements in the array in such a way that the entire set of available threads processes the entire set of data together.
Block: This is a thread group. Not much can be said about the execution of threads within a block: they can be executed either sequentially or in parallel, or without any particular order. You can coordinate threads, for example, using a function
_syncthreads()that forces a thread to stop at a certain point in the kernel and wait until all other threads in the same block also reach the same point.
Grid : This is a group of blocks. There is no synchronization between the blocks.
CUDA: threads, blocks, grids ~ Source
But where exactly are threads, blocks and grids executed? In the case of Nvidia’s G80 GPU architecture, the calculations are apparently distributed as follows:
Grid → GPU: The entire grid is processed by a single GPU processor.
Block → MP: The GPU processor is organized as a collection of multiprocessors, where each multiprocessor is responsible for processing one or more blocks in the grid. One block is never divided between several MPs.
Stream → PP: Each MP is further subdivided into stream processors (PP), and each PP processes one or more threads in a block.
I borrowed some material from this excellently written article . I recommend reading it carefully.
As I said at the very beginning, the essence of this article is to help a wide audience understand the power of the GPU and gain an intuitive understanding of how to use the GPU to solve everyday tasks. Before you start writing code for the GPU, you may have to do some preliminary research. To do this, let’s take an example of a program for adding arrays.
Suppose we have two arrays, ‘a’ and ‘b’ of size ‘n’. We want to generate an array ‘c’ such that each element of array c is the sum of elements with the same indices from the arrays ‘a’ and ‘b’. But in this case, to solve the problem, we will not use sequential calculations, but parallel ones that are done using the GPU.
We will launch n threads / cores. The index under which each specific thread works can be derived from the following formula:
index = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
In the case of a two-dimensional matrix, the index contains two components that mean a row and a column, which can be displayed as follows: We also have to determine the number of threads per block, say, tpb and blocks per grid, say bpg. We will use standard numbers for them. It is important to note another important concept here: when any calculations have to be performed on the GPU, the corresponding data must be transferred to the global memory of the GPU, and the results of the calculations can then be transferred back to the host. These operations are performed using the and functions provided in the Python Numba library.
row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
Below are the implementations of this solution for both the CPU and GPU. See both listings for the difference.
Sequential implementation for CPU.
Parallel implementation for GPU
Having thoroughly studied theory and practice, we return to the original task: to find the top 3 similar positions for each position in the list using GPU calculations.
In this case, the main idea is that we have n positions, and we will start n threads. Each thread will work in parallel with the others and independently of them, calculating the 3 most similar positions for each position in the list. Each position will be responsible for one thread.
from numba import cuda # Nvidia's GPU Library import numpy as np import math #Выполнение математических вычислений, например, нахождение скалярного произведения, Numpy не поддерживается на GPU import random # Определение функции ядра @cuda.jit('void(float32[:,:],float32[:],int32[:],float32[:],int32[:],float32[:],int32[:])') # CUDA Just-in-time Compiler def cuda_dist(X,first_best_val,first_best_index,second_best_val,second_best_index,third_best_val,third_best_index): """Поток будет выполнять эту функцию ядра.""" # Отображение потока на строку row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x; if ((row >= n)): # Принимать потоки только в границах матриц return first_best_val[row] = SMALL # косинусное сходство с первой по схожести позицией first_best_index[row] = -1 # индекс первой по схожести позиции будет храниться здесь second_best_val[row] = SMALL # косинусное сходство со второй по схожести позицией second_best_index[row] = -1 # индекс второй по схожести позиции будет храниться здесь third_best_val[row] = SMALL # косинусное сходство с третьей по схожести позицией third_best_index[row] = -1 # индекс третьей по схожести позиции будет храниться здесь for i in range(n): if(i==row): # Тот же элемент continue # Переменные для вычисления скалярного произведения, так как мы не можем использовать операции numpy в контексте GPU tmp = 0.0 magnitude1 = 0.0 magnitude2 = 0.0 for j in range(k): tmp += X[row,j] * X[i,j] magnitude1 += (X[row,j]* X[row,j]) magnitude2 += (X[i,j]* X[i,j]) tmp /= (math.sqrt(magnitude1)*math.sqrt(magnitude2)) # Значение скалярного произведения Dot_product(a,b) = a.b/(|a|*|b|) if(tmp>=first_best_val[row]): third_best_val[row] = second_best_val[row] third_best_index[row] = second_best_index[row] second_best_val[row] = first_best_val[row] second_best_index[row] = first_best_index[row] first_best_val[row] = tmp first_best_index[row] = i elif(tmp>=second_best_val[row]): third_best_val[row] = second_best_val[row] third_best_index[row] = second_best_index[row] second_best_val[row] = tmp second_best_index[row] = i elif(tmp>third_best_val[row]): third_best_val[row] = tmp third_best_index[row] = i # Подробности об устройстве device = cuda.get_current_device() d_x = cuda.to_device(X) d_first_val = cuda.device_array_like(first_val) d_first_index = cuda.device_array_like(first_index) d_second_val = cuda.device_array_like(second_val) d_second_index = cuda.device_array_like(second_index) d_third_val = cuda.device_array_like(third_val) d_third_index = cuda.device_array_like(third_index) tpb = device.WARP_SIZE # blocksize или количество потоков на блок bpg = int(np.ceil((n)/tpb)) # блоков на грид %time cuda_dist[bpg,tpb](d_x,d_first_val,d_first_index,d_second_val,d_second_index,d_third_val,d_third_index) # вызов ядра # Перенос вывода с устройства на хост first_val = d_first_val.copy_to_host() print (first_val[:10]) # Первые 10 значений # Перенос вывода с устройства на хост first_index = d_first_index.copy_to_host() print (first_index[:10]) # Первые 10 индексов
Implementation for the GPU ~ Code for finding the 3 positions most similar to the given
spent The total time spent by the GPU to find the top 3 similar positions for each position in the list was 481 ms (0.5 seconds). It took another 20 seconds to copy data from the device to the host and from the host to the device.
The task, the solution of which on the CPU would take about 2 days, on the GPU was solved in 20.5 seconds. This was possible only due to the nature of the task. The search for the 3 most similar positions for ‘A’ does not depend on the search for the 3 most similar positions for ‘B’. We took advantage of this fact and applied the parallelism provided by the GPU to speed up the process. The example also illustrates what types of tasks are most conveniently solved with the help of a powerful GPU.
This study was performed on Walmart Labs MLP (machine learning platform) . Thanks to Ayush Kumar for helping streamline the workflow.
Hello, Habr! Today I decided to publish a post on the company’s blog that I wrote one on one with a laptop. No one knew about him, no one read and agreed on him. In a difficult (what can I say, disgusting) time for everyone to isolate themselves and reduce business activity, I want to suddenly talk about CRM – but not as an engineer and not as a marketer, but as a person who came across a CRM system already in 2009, being novice analyst. Then even a third of modern CRM developers as such were not. For 11 years, the world of CRM-systems has changed incredibly, we have changed. Automation in small and medium-sized businesses is still small. And this, my friends, is sad.
I don’t laugh from such pictures, I’m sad
2008 was the first year of my work in business. I came to a large interregional company (no joke, 7 branches, numbering over 500 people) from the world of science and education, having an excellent analytical background, a suspended language, graduate school and … complete, total misunderstanding of the business. I went to work on June 19 and it was a whole summer to immerse myself in business topics before the post-vacation load on the company’s services began. If you remember 2008, then autumn turned out to be so-so – the country plunged into a large-scale systemic economic crisis. I well remember how we were re-issued salary cards in several banks, as one went bankrupt, in others there were continuous interruptions in transactions. In short, it was very difficult, the company’s customer base was falling, people stopped using the services of the company, corporate clients left. There were reductions. I stayed – by a miracle, I can’t remember.
And in the midst of all this crisis, the commercial management came to the conclusion that we needed a CRM system. I was included in the implementation working group from the sales and partner services. It was an incredible responsibility and hard analytical work – it was during that period that I learned that after 15 hours of collecting requirements and analyzing business processes, it’s easier not to go home, but to sleep comfortably on an office sofa so that you can get some light – and again for work. So, we chose CRM.
With my own hands, I tested SAP, Salesforce, Oracle Siebel CRM (I drowned for them because I worked with Oracle), and even CBOSS as a more or less suitable Russian-language alternative, together with colleagues from the IT department. Beautiful and smart consultants of these vendors came to our office, initially it was about 40 million rubles. If you think that for our big company it’s a penny, I’ll inform you that it was 11.5% of gross revenue (not profit!). This is pretty weighty. I will not bore you with stories about collecting requirements, I will tell you with milestones about the main problems.
It seemed that we were having some kind of feast during the plague: our corporate clients were closing, and we were introducing a huge complex software. By the beginning of autumn 2009, our system was filled with data, was able to build segments and do SMS-mailings on them, supported OLAP and was in the process of integration with the accounting system and the workflow system (both Russian with specific characteristics).
So, by the beginning of the school year and the end of the vacation season, we started working in CRM. In the first month there was no effect, at the beginning of the second there was despair and a feeling of guilt – after all, I was responsible for CRM at the very front, in the sales service. At the end of October, we saw revenue growth, expansion of ancillary services segment, and sales growth. Considered for organic growth. Well, it is clear that the growth of November and December was attributed to New Year’s hype for everything, including our services. In March, we already knew for sure: CRM is working, revenue growth has become almost pre-crisis. The new system paid off due to its cool analytics, very fast work with inquiries and cool individual offers. We overcame the crisis faster than even very large competitors, because the company really worked with the client base, we managed it!
Further, the story is not the most fun – the company changed its owner, it updated CRM with investor money and our debugged machine for making money went into oblivion with the team. We are all gone.
I went to a company that developed, among other cool software, a simple and pretty CRM and sold it to the whole world (minus the CIS, because the principles of imported CRM and CRM for import do not get along with Russian business reality). My observations are in a few lines.
Well, the company itself, using its CRM, on some mailings and calls, sold implementation services and software an additional couple of tens of thousands of $$ per month. And when I met the crisis of 2014 in this company, the only thing the team did was to plow and plow the internal CRM for opportunities. The effect was, the losses were minimal.
Even then, I knew a lot about the CRM market in Russia and abroad, but I never worked for a company where CRM is a core business. It was interesting to do this. So in my life RegionSoft and our RegionSoft CRM appeared quite suddenly . This is a real, professional CRM-th hardcore – everything I wanted.
The attitude of Russian small and medium businesses to CRM systems is specific. CRM is interesting to him, but there are several important points that determine the attitude. They are most visible from the inside.
For a long time I tried to answer my question: is it expensive for a company to implement CRM? The conclusion is this: if it implements small and medium-sized businesses and it has standard requirements, then no, not expensive. This is an investment that pays off – the payback period depends on the intensity of the work and on the scope of activity. And the payback is primarily not in profit growth, but in cost reduction and redistribution of labor resources. It’s great if you manage to spend a little more time on implementation and training than usual – the work will go uphill. Now is just the time: to choose, understand, review processes, train personnel and press the gas as soon as business activity begins to come to life.
But again, I repeat – this is my personal opinion and my experience. I believe in a CRM system as a tool and as the foundation of a modern company. And so far she has not deceived.
Work in the Russian CRM developer is a thorough challenge, because, as I said, the attitude to CRM in Russia is very different from the western one. It is interesting, responsible, difficult to go this way. For example, we decided not only to implement CRM, but also to share everything that we know about them – for free, so that businesses can read articles, find answers to questions and look for their ideal CRM (I soberly realize that this can be any solution). We chose Habr for these purposes – by the way, today, without one day, 4 years, as we are at Habr, and this is our 121st article. Articles are written exclusively by our team; we do not have any left-wing copywriters, editors, or other outsourcers.
We ourselves develop, implement and support ourselves, provide technical support ourselves, write articles ourselves – we do not belong to anyone, and therefore we are responsible for our work.
After 6 years of work, I know how CRM can change a company because I saw many implementations that made the life of companies easier and better. And every time I hear a review or see how our CRM is recommended, I am proud and confident in the decision and in our team.
There is a long way ahead, interesting solutions for business, development of the CRM system itself, and I know that customer requirements, their feedback on real use and current market demand will always be at the forefront. And CRM will continue to help customers make the business advanced.
Today, RegionSoft is 19 years old, 14 of which the company is developing its entire CRM system based on customer requirements. The team works for small and medium-sized businesses – so that it works quickly, defeats the routine and reliably and safely interacts with the client base. We are close to our customers at any time. Because in 2020, automation decides. And further without it will be harder – this is an obvious trend.