# Down with cycles, or Lazy composition of algorithms in C++

“Whoever has never made a mistake in indexing a loop, let the first throw an exception in the destructor.”- Ancient wisdom

The cycles are terrible. Cycles are difficult to read – instead of immediately understanding the author’s intention, you first have to delve into the code to understand *what exactly* he does. In a loop, it is easy to make a mistake with indexing and override the loop index in a nested loop. Cycles are difficult to maintain, correct in case of errors, it is difficult to make ongoing changes, etc. etc.

In the end, it’s just plain ugly.

Since ancient times, mankind has been trying to simplify the writing of cycles. Initially, programmers noticed frequent repeating loops and allocated them to separate functions. Then they came up with lazy iterators, and then ranges. And each of these ideas was a breakthrough. But despite this, the ideal has not yet been achieved, and people continue to look for ways to improve their code.

This work aims to shed light on a not at all new, but so far not too widespread idea, which is quite capable of making another breakthrough in the field of writing C ++ programs.

So how do you write beautiful, understandable, efficient code, and also be able to parallelize large calculations with a flick of your finger on the keyboard?

## Content

- Existing models
- Basic concepts
- Ideology
- I’m a bird, it’s difficult for me, can I look at the code right away?
- Multithreading
- comparison table
- References

## Existing models

The main ways to get rid of loops at the moment are algorithms from the standard library and lazy iterators and ranges from the Boost.Iterator , Boost.Range and range-v3 libraries .

range-v3 partially got into the standard C ++ 20 library, but, firstly, they got there in a fairly truncated form, and secondly, there are no corresponding implementations at the moment.

Standard algorithms are beautiful, and help get rid of cycles, but, unfortunately, only in the simplest cases, since several such algorithms cannot be combined into a single calculation. For each step you will have to store the intermediate result. And this is an overspending in memory, and difficulties with type inference for intermediate results, that is, code complication.

It is because of this that lazy iterators and ranges appeared in third-party libraries, and in C ++ 17 hybrids of the family appeared `std::transform_reduce`

.

Lazy iterators and ranges solve many problems. But they themselves are not without their own problems. In particular, since they are separated from the computation scheme (they define only operations on individual elements of a sequence), it is difficult to parallelize them. And standard algorithms already with C ++ 17 have parallel versions that can more efficiently use multi-core architectures.

The question arises: is it possible to combine the advantages of both approaches simultaneously? It turns out you can. This will be discussed further.

## Basic concepts

In order to move further, it is necessary to understand what a *convolution is* .

### Definition No. 1: convolution

*Convolution* is a computational process performed on a certain sequence of values according to the rule defined by the convolution kernel .

The result of the convolution is the value obtained by sequentially applying the convolution kernel to the current value and the next element of the sequence.

### Definition # 2: convolution core

*The convolution core* is an action performed at each step of the convolution. Applies to the current convolution value and the next element of the sequence.

This figure shows a convolution of the sequence with the kernel and the initial value

In the standard library, convolution is represented by `std::accumulate`

and `std::reduce`

.

## Ideology

So, in order to understand the main idea of this approach, you need to pay attention to several well-known facts.

### Fact No. 1: each cycle can be represented as a convolution

And really:

- The program context before the start of the cycle is the initial value;
- Index set, container, range, etc. – sequence of elements;
- A loop iteration is the application of a two-place operation (convolution kernel) to the current value and the next element of the sequence, as a result of which the current value changes.

```
auto v = 0; // Начальное значение: v_0
for (auto i = 0; i < 10; ++i) // Последовательность: [x_0, x_1, ...]
{
v = f(v, i); // Двуместная операция, изменяющая
// значение: v_{i + 1} = f(v_i, x_i)
}
```

In other words, in order to express any cycle, a basis from one single operation is enough – convolution. And all other operations – for example, standard algorithms – can be expressed through it.

#### Example No. 1: convolution display

```
template <ForwardIterator I, OutputIterator J, UnaryFunction F>
J transform (I begin, I end, J result, F f)
{
// Начальное значение — это выходной итератор.
auto initial_value = result;
// Ядро свёртки.
auto binary_op =
[] (auto iterator, auto next_element)
{
// Записываем в текущий итератор результат отображения...
*iterator = f(next_element);
// ... и возвращаем продвинутый итератор.
return ++iterator;
};
// Свёртка.
return accumulate(begin, end, initial_value, binary_op);
}
```

#### Example No. 2: convolution filtering

```
template <ForwardIterator I, OutputIterator J, UnaryPredicate P>
J copy_if (I begin, I end, J result, P p)
{
// Начальное значение.
auto initial_value = result;
// Ядро свёртки.
auto binary_op =
[p] (auto iterator, auto next_element)
{
if (p(next_element))
{
*iterator = next_element;
++iterator;
}
return iterator;
};
// Свёртка.
return accumulate(begin, end, initial_value, binary_op);
}
```

All other sequential algorithms are expressed in a similar way. An inquisitive reader can do this as an exercise.

### Fact # 2: most cycles break down into simple components

If you look closely, it will become clear that most of the cycles are typical. They are decomposed into simple components:

- Conversion;
- Filtration;
- Grouping
- Count;
- Summation;
- Writing to an array;
- …
- etc.

This means that you need to select a sufficiently expressive basis of operations to cover the vast majority of possible cycles with their combinations, and also learn how to easily and conveniently compose these combinations from the point of view of program code.

### Fact No. 3: each convolution can be represented as an automaton

By definition, an automaton is a system that can reside in various states, and the transition between these states occurs when a certain effect is produced on the system.

So, if we consider convolution as an automaton, then the states of this automaton are the totality of the possible values of the variable, and the effect is the application of the convolution kernel to the current value of the variable and the next element of the sequence.

**Important:**In this model, a

generalization of automata is considered when there are not only input symbols under the influence of which a transition between states occurs, but also output symbols accompanying this transition.In the diagram, the input symbol is drawn above the transition arrows, and the output symbol is drawn below the arrow.In addition, our machine may have memory.

#### Example No. 1: automatic display

For example, this will look like an automaton for display ( `transform`

, or `map`

in functional programming).

Here

This machine has one state and one transition.

#### Example No. 2: filtering machine

Here

This machine has one state and two transitions. One transition is realized when the input symbol

### Fact No. 4: machines are combined

If the machine has an output, then obviously this output can be fed to the input of another machine.

Thus, having a set of several automata, each of which defines one transformation operation, it is possible to compose quite complex transformations.

### Back to convolution

To get the convolution we need, at the end of the chain we put an automaton, which is the core of the convolution.

Further, we note that all the automata, except the last, as if prepare the data for it, so you can mentally collapse all the automata into the last. Get the core of the convolution. And this is the body of the cycle, which we wanted to record.

So, we decomposed the cycle into simple components and presented it using convolution. In theory, everything is fine, but how will it look in the code?

## The code

Based on the above ideas, the Proxim library was developed .

### Simple example

```
#include <proxima/compose.hpp>
#include <proxima/kernel/sum.hpp>
#include <proxima/reduce.hpp>
#include <proxima/transducer/stride.hpp>
#include <proxima/transducer/take_while.hpp>
#include <proxima/transducer/transform.hpp>
#include <cassert>
int main ()
{
const int items[] = {1, 2, 3, 4, 5};
const auto kernel =
proxima::compose
(
proxima::transform([] (auto x) {return x * x;}), // 1. Каждый элемент возведён в квадрат;
proxima::stride(2), // 2. Берутся только элементы с номерами,
// кратными двойке (нумерация с нуля);
proxima::take_while([] (auto x) {return x < 10;}), // 3. Элементы берутся до тех пор, пока
// они меньше десяти;
proxima::sum // 4. Результат суммируется.
);
const auto x = proxima::reduce(items, kernel);
assert(x == 10); // 1 * 1 + 3 * 3
}
```

### constexpr

It can be noted that the code from the example can be executed at the compilation stage:

```
#include <proxima/compose.hpp>
#include <proxima/kernel/sum.hpp>
#include <proxima/reduce.hpp>
#include <proxima/transducer/stride.hpp>
#include <proxima/transducer/take_while.hpp>
#include <proxima/transducer/transform.hpp>
int main ()
{
constexpr int items[] = {1, 2, 3, 4, 5};
constexpr auto kernel =
proxima::compose
(
proxima::transform([] (auto x) {return x * x;}), // 1. Каждый элемент возведён в квадрат;
proxima::stride(2), // 2. Берутся только элементы с номерами,
// кратными двойке (нумерация с нуля);
proxima::take_while([] (auto x) {return x < 10;}), // 3. Элементы берутся до тех пор, пока
// они меньше десяти;
proxima::sum // 4. Результат суммируется.
);
constexpr auto x = proxima::reduce(items, kernel);
static_assert(x == 10); // 1 * 1 + 3 * 3
}
```

Most of Proxima can be performed at the compilation stage.

## Multithreading

One of the key features of the described model is that it is easy to parallelize.

In Proxima, there is a mechanism by which it is very easy to parallelize computations. This is done using a dummy converter `pipe`

, which acts as a “stream splitter”:

```
proxima::reduce(values,
proxima::compose
(
proxima::for_each(hard_work), // | Поток №1
// ----------
proxima::pipe, // Разделитель потоков
// ----------
proxima::for_each(hard_work), // | Поток №2
// ----------
proxima::pipe, // Разделитель потоков
// ----------
proxima::for_each(hard_work), // | Поток №3
proxima::sum // | Поток №3
));
```

The entry above means that three threads will be created, and in each of them only part of the work on the next element of the sequence will be performed.

To show the effectiveness of such a partition, consider an example (the complete code lies with Gitlab ).

In it, we will measure the difference between the three threads parallelized in a convolution, an ordinary convolution, and a simple cycle. To simulate “heavy” calculations, we will make a function that simply falls asleep for a few microseconds. And we will generate a set of random numbers, which will determine the time of falling asleep.

```
auto hard_work (std::int32_t time_to_sleep)
{
std::this_thread::sleep_for(std::chrono::microseconds(time_to_sleep));
}
const auto proxima_crunch_parallel =
[] (auto b, auto e)
{
return
proxima::reduce(b, e,
proxima::compose
(
proxima::for_each(hard_work),
proxima::pipe,
proxima::for_each(hard_work),
proxima::pipe,
proxima::for_each(hard_work),
proxima::sum
));
};
const auto proxima_crunch =
[] (auto b, auto e)
{
return
proxima::reduce(b, e,
proxima::compose
(
proxima::for_each(hard_work),
proxima::for_each(hard_work),
proxima::for_each(hard_work),
proxima::sum
));
};
const auto loop_crunch =
[] (auto b, auto e)
{
auto sum = typename decltype(b)::value_type{0};
while (b != e)
{
hard_work(*b);
hard_work(*b);
hard_work(*b);
sum += *b;
++b;
}
return sum;
};
```

If we generate 1000 random falls asleep in the range from 10 to 20 microseconds, we get the following picture (the operating time of the corresponding processor is shown – the less, the better):

```
proxima_crunch_parallel | 0.0403945
proxima_crunch | 0.100419
loop_crunch | 0.103092
```

And the more “fat” the computational functions are, the greater will be the separation of the multi-threaded version. For example, if you take random sleeps in the range from 100 to 200 microseconds, then the picture will be as follows:

```
proxima_crunch_parallel | 0.213352
proxima_crunch | 0.624727
loop_crunch | 0.625393
```

That is, almost three times faster, as it would be if it were ideally decomposed into three streams.

## comparison table

Library | STL (Algorithms) | Boost | range-v3 | Proxima |
---|---|---|---|---|

Composability | No | Yes | Yes | Yes |

Type inference | poorly | Medium | Medium | Good |

Parallelization | Nearly* | No | No | Yes |

Compatibility | Boost | STL | STL | All |

Extensibility | Complicated | Fine | Complicated | Easy |

Independence | Yes | Yes | Yes | Not really |

constexpr | Partially | No | Partially** | Yes*** |

Model | Monolithic | Lazy | Lazy | Lazy |

*) Parallelization in STL is not yet implemented everywhere.**) constexpr ranges will probably be better when they fall into STL.***) constexpr Proxima depends on STL. All that is already constexpr. Everything that depends on the STL will be constexpr as soon as it is in the STL.

## Leave a Reply