# How to speed up the game “Life” a hundred times

It is difficult to find a person who is not familiar with the game ” Life “, invented by the English mathematician John Conway back in 1970, and still not losing its popularity. Many programmers wrote their implementation of this game, and another is unlikely to surprise anyone. However, this game is a great example showing how useful optimization of calculations can be, even without changing the asymptotic complexity of the algorithm. We will start with the simplest implementation in c # and will consistently apply various optimizations, speeding up the program.

We will also improve the algorithm in Javascript, speeding it up 10 times compared to the naive implementation.

At the end of the article, a link is given to the code, as well as to the online implementation of the game with an optimized JavaScript algorithm, performing up to two hundred iterations per second on a field of size 1920×1080 (Full HD), where you can ~~kill time~~ playing this wonderful game.

### Rules of the game

The game “Life” is on a plane consisting of cells, each cell has 8 neighbors. Cells can be live or dead. In each iteration (also called “generation”), the state of the field changes depending on the number of living neighbors: if there are two or three living neighbors next to a living cell, the cell survives in the next generation. If next to a dead cell there are exactly three living neighbors, then in the next generation the cell is “born”. For example:

– cell **A** has one living neighbor, so it dies;

– cell **B** has three living neighbors, so it remains to live;

– cell **C** has three living neighbors, so it is born;

– cell **D**has five living neighbors, so she remains dead.

C # code requires .NET Core 3.1. With the exception of one class (using System.Runtime.Intrinsics.X86), the rest of the code should work in the .NET Framework on earlier versions of .NET Core. Development was conducted in Visual Studio Community Edition 2019.

Implementation in javascript does not require any dependencies, although it is advisable to run the game in the Chrome browser, since it runs extremely slowly in Microsoft Edge. Development was carried out in Visual Studio Code.

### How c # code works

The size of the game field is fixed as 1920×1080 (Full HD), and the extreme cells are always dead. This does not affect performance, but simplifies the algorithms, eliminating checks for boundary conditions. All algorithms are inherited from the abstract LifeJourney class , which allows unifying their testing:**LifeJourney.cs**

```
public abstract class LifeJourney
{
public const int WIDTH = 1920;
public const int HEIGHT = 1080;
public string Name => GetType().Name;
public abstract void Set(int i, int j, bool value);
public abstract bool Get(int i, int j);
public abstract void Step(); // makes one iteration. Performance-critical!
public void Clear()
{
for (int i = 1; i < WIDTH - 1; i++)
for (int j = 1; j < HEIGHT - 1; j++)
Set(i, j, false);
}
public void GenerateRandomField(int seedForRandom, double threshold) { ... }
... // other methods
}
```

Unit tests run the same test suite for each algorithm:**LifeTesting.cs**

```
public class LifeTesting
{
public static void PerformAllTests(LifeJourney life)
{
TestGetSet(life);
TestSetSimpleFigure(life);
TestSimpleFigureAtStart(life);
TestSimpleFigureAtSecondPos(life);
TestSimpleFigure(life);
TestGenerate(life);
TestRandomField(life);
}
...
private static void TestRandomField(LifeJourney life)
{
life.GenerateRandomField(12345, 0.5);
life.Step();
Assert.AreEqual(565797, life.GetLiveCellsCount());
Assert.AreEqual(-717568334, life.GetFingerprint());
}
}
[TestClass]
public class TestDifferentLives
{
[TestMethod]
public void Test_1_SimpleLife() =>
LifeTesting.PerformAllTests(new SimpleLife());
...
}
```

Performance testing creates a field with a random fill using a random number generator with a fixed seed (so each time the field is the same). After that, a cycle of thousands of iterations is launched, and at the end the time is measured (in seconds), and the final state of the field (the number of surviving cells and a hash code) is checked to additionally verify the correct operation of the algorithm:**RunPerformanceTest**

```
public abstract class LifeJourney
{
public void GenerateRandomField(int seed, double threshold)
{
Random rand = new Random(seed);
for (int i = 1; i < WIDTH - 1; i++)
for (int j = 1; j < HEIGHT - 1; j++)
{
Set(i, j, rand.NextDouble() < threshold);
}
}
public void RunPerformanceTest(int steps)
{
GenerateRandomField(12345, 0.5);
Stopwatch timer = new Stopwatch();
timer.Start();
Run(steps);
double elapsedSeconds = timer.Elapsed.TotalSeconds;
Console.WriteLine($"{Name} Performance: {(steps / elapsedSeconds):0.000} steps/sec");
}
... // other methods
}
class Program
{
static void Main(string[] args)
{
int steps = 1000;
new SimpleLife().RunPerformanceTest(steps);
...
}
}
```

### 1. Naive implementation

Let’s start with the simplest implementation: the field is represented by a boolean array that stores information about living cells; Each iteration is performed in two passes:

1) for each cell we count the number of its neighbors and write the future state of the cell in a separate temp array;

2) copy the temp array to the original array.

Two passes are needed because the state of the cell in the next step depends on the state of its neighbors in the current step, so we cannot change the state of the cells until we have calculated the whole next step.**Simplelife**

```
public class SimpleLife : LifeJourney
{
bool[,] field = new bool[WIDTH, HEIGHT];
bool[,] temp = new bool[WIDTH, HEIGHT];
public override bool Get(int i, int j) => field[i, j];
public override void Set(int i, int j, bool value) => field[i, j] = value;
public override void Step()
{
for (int i = 1; i < WIDTH - 1; i++)
{
for (int j = 1; j < HEIGHT - 1; j++)
{ // первый проход: вычисляем будущее состоянее
bool isAlive = field[i, j];
int numNeigbours = 0;
if (field[i - 1, j - 1]) numNeigbours++;
if (field[i - 1, j]) numNeigbours++;
if (field[i - 1, j + 1]) numNeigbours++;
if (field[i, j - 1]) numNeigbours++;
if (field[i, j + 1]) numNeigbours++;
if (field[i + 1, j - 1]) numNeigbours++;
if (field[i + 1, j]) numNeigbours++;
if (field[i + 1, j + 1]) numNeigbours++;
bool keepAlive = isAlive && (numNeigbours == 2 || numNeigbours == 3);
bool makeNewLive = !isAlive && numNeigbours == 3;
temp[i, j] = keepAlive | makeNewLive;
}
}
for (int i = 1; i < WIDTH - 1; i++)
for (int j = 1; j < HEIGHT - 1; j++)
// второй проход: копируем вычисленное состояние в текущее
field[i, j] = temp[i, j];
}
}
}
}
```

The result of work:

```
Алгоритм: SimpleLife
Время работы: 40 секунд
Скорость работы: 25 шагов в секунду
```

Source Code: SimpleLife.cs

### 2. Add bytes

In the previous algorithm, 8 comparison operations are performed for each cell. Let’s try to reduce their number:

– Instead of a two-dimensional boolean array, create a one-dimensional byte array in which the living cell is represented by 1 and the dead cell is 0.

– Instead of eight checks, we will sum the bytes of the neighbors of the current cell, then their sum will be equal to the number of living neighbors.

– In the temporary array, we will write the sum of the cells, postponing the check for the second pass:**Lifebytes**

```
public class LifeBytes : LifeJourney
{
byte[] field = new byte[WIDTH * HEIGHT];
byte[] temp = new byte[WIDTH * HEIGHT];
public override bool Get(int i, int j) => field[j * WIDTH + i] == 1;
public override void Set(int i, int j, bool value) => field[j * WIDTH + i] = (byte)(value ? 1 : 0);
public override void Step()
{
for (int i = 1; i < WIDTH - 1; i++)
for (int j = 1; j < HEIGHT - 1; j++)
{ // считаем число соседей для текущей клетки и кладем в temp[pos]
int pos = j * WIDTH + i;
temp[pos] = (byte)(
field[pos - WIDTH - 1] + field[pos - WIDTH] + field[pos - WIDTH + 1] +
field[pos - 1] + field[pos + 1] +
field[pos + WIDTH - 1] + field[pos + WIDTH] + field[pos + WIDTH + 1]);
}
for (int i = 1; i < WIDTH; i++)
for (int j = 1; j < HEIGHT; j++)
{ // обновляем текущее состояние
// по числу соседей и прошлому состоянию клетки
int pos = j * WIDTH + i;
bool keepAlive = field[pos] == 1 && (temp[pos] == 2 || temp[pos] == 3);
bool makeNewLife = field[pos] == 0 && temp[pos] == 3;
field[pos] = (byte)(makeNewLife | keepAlive ? 1 : 0);
}
}
}
```

We check:

```
Алгоритм: LifeBytes
Время работы: 25 секунд
Скорость работы: 40 шагов в секунду
Ускорение от предыдущей версии: 1.6
Ускорение от первой версии: 1.6
```

Source Code: LifeBytes.cs

### 3. Add eight-byte words

Note that during the first pass, the same eight additions are performed for all bytes, which

suggests that instead of processing one byte, immediately process 8 bytes, since for many instructions working with bytes, modern processors have similar instructions working with two -, in four and eight byte words. In our case, addition of eight-byte words will give the same result as eight additions of neighboring bytes in a row:

Note that this is not always true, since if overflow occurs during arithmetic operations with bytes, then similar operations with long words will give an incorrect result. However, we were lucky, and the maximum value that we can get when adding in a byte is eight, and overflow cannot happen with us.

To replace operations on bytes with operations on eight-byte words, we need to use the pointer arithmetic available in c # :

– mark the class as unsafe;

– get pointers to both arrays;

– convert byte pointer to ulong pointer**like this**

```
fixed (byte* fieldPtr = field, tempPtr = temp)
{
for (int i = WIDTH + 1; i < WIDTH * HEIGHT - WIDTH - 1; i += 8)
{
ulong* ptr = (ulong*)(tempPtr + i);
*ptr += *(ulong*)(fieldPtr + i - WIDTH - 1);
*ptr += *(ulong*)(fieldPtr + i - WIDTH);
*ptr += *(ulong*)(fieldPtr + i - WIDTH + 1);
*ptr += *(ulong*)(fieldPtr + i - 1);
*ptr += *(ulong*)(fieldPtr + i + 1);
*ptr += *(ulong*)(fieldPtr + i + WIDTH - 1);
*ptr += *(ulong*)(fieldPtr + i + WIDTH);
*ptr += *(ulong*)(fieldPtr + i + WIDTH + 1);
}
...
}
```

We measure performance, and …

```
Алгоритм: LongLife
Время работы: 10 секунд
Скорость работы: 100 шагов в секунду
Ускорение от предыдущей версии: 2.5
Ускорение от первой версии: 4
```

Source Code: LongLife.cs

### 4. Add a search in the table

We have optimized the first pass quite well, however, in the second pass, several checks are performed for each byte, which we want to get rid of. The standard optimization in such cases is a lookup table, when the result for all combinations of input data is calculated in advance and put into the table. Our input data is the current state of the cell (one byte in the original array) and the number of neighbors (one byte in the temporary array). We could calculate a table for all combinations of these bytes, which will give us a table size of 16 kilobytes, however, we can reduce the size of the table by simplifying the code at the same time. The idea is that the number of neighbors from 0 to 8 does not occupy a whole byte, but fits in the lower three bits (in fact, four, but 8 neighbors and 0 neighbors give the same effect, so we ignore the fourth bit in the number of neighbors) . Therefore, we will encode the state of the current cell in the fourth bit of the same byte, and our table will be 16 bytes long, of which only three values are equal to one. Values equal to one correspond to indices encoding a dead cell with three neighbors or a living cell with two or three neighbors:

**The code itself is not much more complicated.**

```
public unsafe class LifeIsLookingUp : LifeJourney
{
static byte[] alivePerNeighbours = new byte[256];
byte[] field = new byte[WIDTH * HEIGHT];
byte[] temp = new byte[WIDTH * HEIGHT];
public LifeIsLookingUp()
{
alivePerNeighbours[3] = 1;
alivePerNeighbours[8 + 2] = 1;
alivePerNeighbours[8 + 3] = 1;
}
...
public override void Step()
{
fixed (byte* fieldPtr = field, tempPtr = temp)
{
...
// второй прогон использует таблицу!
for (int i = WIDTH; i < WIDTH * HEIGHT - WIDTH; i++)
{
byte neighbours = (byte)((tempPtr[i] & 7) | (fieldPtr[i] << 3));
fieldPtr[i] = alivePerNeighbours[neighbours];
}
...
}
}
```

As a result, we get the acceleration almost twice:

```
Алгоритм: LifeIsLookingUp
Время работы: 5.5 секунд
Скорость работы: 180 шагов в секунду
Ускорение от предыдущей версии: 1.8
Ускорение от первой версии: 7.2
```

Source Code: LifeIsLookingUp.cs

### 5. Add bit magic

This time, we will forget the previous optimization and optimize the second step again, not only to get rid of conditional expressions, but, as in the first step, to calculate eight cells at a time. We need, in fact, to come up with a bit function that will get the same results as the lookup table from the previous step. Those who have studied Boolean algebra know that for any truth table (and our lookup table is just that) you can write the corresponding Boolean function, although it can turn out to be complicated. However, there is a Carnot diagram method that allows one to obtain a highly optimized Boolean function from the truth table in a simple and clear way. This method is perfectly applicable to our table; however, given that it has only 16 input values, with some ~~luck~~experience you can select the desired function by exhaustive search. We omit the intermediate calculations and give**the resulting code**

```
for (int i = WIDTH; i < WIDTH * HEIGHT - WIDTH; i ++)
{
byte neighbours = tempPtr[i];
byte alive = fieldPtr[i];
neighbours &= (byte)0b00000111;
byte aliveAndNeighbours = ((neighbours & ~(byte)0b00000001) >> 1) | (alive << 2);
aliveAndNeighbours ^= (byte)~0b00000101;
aliveAndNeighbours &= (aliveAndNeighbours >> 2);
aliveAndNeighbours &= (aliveAndNeighbours >> 1);
aliveAndNeighbours &= (byte)0b00000001;
ulong makeNewLife = neighbours | (alive << 3);
makeNewLife ^= (byte)~0b00000011;
makeNewLife &= (makeNewLife >> 2);
makeNewLife &= (makeNewLife >> 1);
makeNewLife &= (byte)0b00000001;
fieldPtr[i] = aliveAndNeighbours | makeNewLife;
}
```

In this form, this code is not much better than searching the table, because it does too many operations per byte. However, it allows you to switch from byte-based computing to processing 8 bytes at a time:**Similar code for ulong instead of bytes**

```
for (int i = WIDTH; i < WIDTH * HEIGHT - WIDTH; i += 8)
{
ulong neighbours = *(ulong*)(tempPtr + i);
ulong alive = *(ulong*)(fieldPtr + i);
neighbours &= 0b00000111_00000111_00000111_00000111_00000111_00000111_00000111_00000111ul;
ulong aliveAndNeighbours = ((neighbours & ~0b00000001_00000001_00000001_00000001_00000001_00000001_00000001_00000001ul) >> 1) | (alive << 2);
aliveAndNeighbours ^= ~0b00000101_00000101_00000101_00000101_00000101_00000101_00000101_00000101ul;
aliveAndNeighbours &= (aliveAndNeighbours >> 2);
aliveAndNeighbours &= (aliveAndNeighbours >> 1);
aliveAndNeighbours &= 0b00000001_00000001_00000001_00000001_00000001_00000001_00000001_00000001ul;
ulong makeNewLife = neighbours | (alive << 3);
makeNewLife ^= ~0b00000011_00000011_00000011_00000011_00000011_00000011_00000011_00000011ul;
makeNewLife &= (makeNewLife >> 2);
makeNewLife &= (makeNewLife >> 1);
makeNewLife &= 0b00000001_00000001_00000001_00000001_00000001_00000001_00000001_00000001ul;
*(ulong*)(fieldPtr + i) = aliveAndNeighbours | makeNewLife;
}
```

The code has become completely incomprehensible, but it works almost twice as fast as the previous version:

```
Алгоритм: LifeInBits
Время работы: 3.3 секунды
Скорость работы: 300 шагов в секунду
Ускорение от предыдущей версии: 1.7
Ускорение от первой версии: 12
```

Source Code: LifeInBits.cs

### 6. We store two cells per byte

The last optimization of this type follows logically from the previous ones: so far we have stored one cell per byte, processing eight bytes at a time, but in all calculations we actually only operated with the four least significant bits in the byte, since the highest four bits are always were zeros. Now we will store two cells per byte: one in the lower four bits, the other in the high bits. A temporary array storing the number of neighbors will also store the number of neighbors independently in the low and high four bits. The benefit is that we will still process eight bytes at a time, but this time there will be not eight, but sixteen cells! Potentially, this can lead to more than twofold acceleration, since, in addition to halving the number of operations, we also halve the size of the used memory,

In the algorithm itself, in the second run, only the iteration constants and boundaries change, but in the first run due to packing two cells per byte, you have to apply logical shifts of 4 and 60 bytes left and right, which makes the code more confusing:**Two cells in a byte**

But the implementation of the get and set methods is complicated. Fortunately, these methods have almost no effect on performance, so you don’t have to worry too much about it. Also, pay attention to the technical trick: for the algorithm to work correctly, we had to add one empty line to the beginning and the end of the array in order to avoid checks for going beyond the boundaries of the array.**get and set**

Here is the result:

Source Code: LifeIsABitMagic.cs

### 7. Using processor SIMD instructions

Each subsequent optimization is given to us with increasing difficulty, so this time we will go the other way and use perhaps the most powerful optimization method available at the moment: SIMD instructions , or rather, the AVX2 instruction set available on most modern Intel processors and AMD, and allowing operations on 32 bytes at a time! This method is not always applicable, but our case (working with bytes and lack of branching) ideally falls on the ideology of SIMD, so we can expect a significant improvement in performance. A few years ago, it was only available in low-level languages. This would not stop us: for the sake of optimization, we would compile the library in C ++ and use it from C # through the P / Invoke mechanism. However, literally a year ago, in .NET Core, built-in support for this set of instructions appeared, there was even an article on Habr . So now we can use this “heavy artillery” directly in C #! We take the penultimate version of the LifeInBits algorithm, which stores one cell per byte (for reasons beyond the scope of this article), and replace the manual manipulation of bytes, bits and eight-byte words with the corresponding AVX2 instructions. At the same time, oddly enough, the code has even become simpler and more understandable!**simple and clear code**

This approach gives us the greatest optimization (a hundred times compared to the original algorithm, so the title of the article is true)!

Source Code: AdvancedLifeExtensions.cs

### It’s all?

For completeness, we should mention two more optimizations that we did not touch on:

1) Transfer of calculations to the video card – since the algorithm fits well with the SIMD logic, acceleration can be expected by another order of magnitude. It is on it that the researchers of the game “Life” (yes, there are such!) Search for new interesting figures by brute force. Unfortunately, it is still impossible to write code for processing data on a video card in .NET.

2) The complexity of all the considered algorithms is linear in time and field area. There is a Hashlife algorithm , in some situations having a **logarithmic**complexity in time and area! It is this algorithm that is used to simulate huge systems for millions and billions of iterations. Nevertheless, there are poor initial conditions under which this algorithm will be much slower than the classical ones, at least at the initial stage.

### Javascript

Finally, we got to the most interesting: can these optimizations be transferred to javascript?

Unfortunately, it is impossible to use the SIMD version, since the AVX2 has not yet been brought there. However, surprisingly, the penultimate version with bit magic and packing two cells in one byte can be used, though not an eight-byte option, but only a four-byte one.

The trick here is that in javascript there are special classes that allow you to work with an array of bytes both as bytes and as numbers: UInt8Array and UInt32Array. At the same time, you can transfer the code almost verbatim (adjusted for processing four bytes instead of eight) and hope that the JIT compiler of the javascript engine will be able to process our bit arithmetic without converting to floating point numbers.**Javascript Code**

Here is the result:

The full code in the lifefield.js file. Finally,

a summary table with all the algorithms:

### Online game

Finally, the most interesting: based on this algorithm, an online game was written that stably worked quickly on a field of size 1920×1080 (Full HD) with any initial configuration:

play the simulation of the game “Life”

The source codes are published on the github.

## Leave a Reply