Marek's blog

Conway's Game of Life on GPU using CUDA

Chapter 2: Basic implementation

In theory, Conway's Game of Life is pretty simple and straight forward to implement. However, one has to be careful with implementation details to achieve good performance.

The first optimization is to get rid of as many if statements as possible. If statements are not good friend with CPU and especially not with GPU. On GPU, the if statement may cause warp divergence that slows down execution.

Counting of neighbors of a cell can be done by using eight if statements but those ifs can be completely avoided. An integer like char (I actually used unsigned char or ubyte) can be used for representing alive state of a cell as value 1 or 0 (alive or dead). This slight change of representation changes the neighbor counting code from eight if statements to seven summations (pseudo-code shown in Code listing 3).

Code listing 3: Pseudo-code showing neighbor counting without condition thanks to clever world representation.
1
2
3
uint aliveCells = isCellAlive[x-1][y-1] + isCellAlive[x][y-1] + isCellAlive[x+1][y-1]
	+ isCellAlive[x-1][y] + isCellAlive[x+1][y]
	+ isCellAlive[x-1][y+1] + isCellAlive[x][y+1] + isCellAlive[x+1][y+1];

The second optimization is to use single array instead of array of arrays. This makes the code less readable but more effective. Every cell access code will change from isCellAlive[x][y] to isCellAlive[y * width + x].

Code listing 4: Actual code for conting of number of alive neighbors.
1
2
3
4
5
6
7
// Y-coordinates y0, y1 and y2 are already pre-multiplied with world width.
inline ubyte countAliveCells(size_t x0, size_t x1, size_t x2,
		size_t y0, size_t y1, size_t y2) {
	return m_data[x0 + y0] + m_data[x1 + y0] + m_data[x2 + y0]
		+ m_data[x0 + y1] + m_data[x2 + y1]
		+ m_data[x0 + y2] + m_data[x1 + y2] + m_data[x2 + y2];
}

Serial CPU implementation

The actual serial implementation of Conway's Game of Life on the CPU is shown in Code listing 5. The method aliveCells is shown in Code listing 4.

The method computeIterationSerial computes one iteration of Conway's Game of Life in the array m_data using m_resultData as helper array. Two arrays are necessary to avoid errors in computation caused by overriding cell states. The source and result arrays are swapped at the end so result will be stored in m_data.

Code listing 5: Serial CPU implementation of Conway's Game of Life.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
typedef unsigned char ubyte;

ubyte* m_data;
ubyte* m_resultData;

size_t m_worldWidth;
size_t m_worldHeight;
size_t m_dataLength;  // m_worldWidth * m_worldHeight

void computeIterationSerial() {
	for (size_t y = 0; y < m_worldHeight; ++y) {
		size_t y0 = ((y + m_worldHeight - 1) % m_worldHeight) * m_worldWidth;
		size_t y1 = y * m_worldWidth;
		size_t y2 = ((y + 1) % m_worldHeight) * m_worldWidth;

		for (size_t x = 0; x < m_worldWidth; ++x) {
			size_t x0 = (x + m_worldWidth - 1) % m_worldWidth;
			size_t x2 = (x + 1) % m_worldWidth;

			ubyte aliveCells = countAliveCells(x0, x, x2, y0, y1, y2);
			m_resultData[y1 + x] =
				aliveCells == 3 || (aliveCells == 2 && m_data[x + y1]) ? 1 : 0;
		}
	}
	std::swap(m_data, m_resultData);
}

Parallel CPU implementation

Conway's Game of Life can be parallelized very easily since every cell needs only information about its neighbors. The parallel code in Code listing 6 is nearly identical to the serial code. The only difference is replacement of serial for-cycle with a parallel one.

The Concurrency::parallel_for is a feature from standard Microsoft library that can be found in header ppl.h (Parallel Patterns Library). This library makes parallel programming so much easier and closer to C# (language which I know the best).

Just in case anybody is wandering how Concurrency::parallel_for works, it gets starting index, end index, increment, and reference to a function (or lambda function) which will be called for every index in the range but in parallel. The function automatically uses maximum number of available hardware cores on current machine. If a machine has only one processor with one core and no hyper-threading, the code will be actually serial.

You may be wondering if usage of static function could be faster than construction of the lambda function on every invocation. I was wandering too. The measurements shown in Figure 4 in the next section reveals that there is no significant difference between those two approaches. Actually, lambda function is sometimes a bit faster, but that was probably just some noise.

Code listing 6: Parallel CPU implementation of Conway's Game of Life.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
typedef unsigned char ubyte;

ubyte* m_data;
ubyte* m_resultData;

size_t m_worldWidth;
size_t m_worldHeight;
size_t m_dataLength;  // m_worldWidth * m_worldHeight

void computeIterationParallel() {
	auto evaluateCell = [&] (size_t index) {
		size_t x1 = index % m_worldWidth;
		size_t y1 = index - x1;
		size_t y0 = (y1 + m_dataLength - m_worldWidth) % m_dataLength;
		size_t y2 = (y1 + m_worldWidth) % m_dataLength;
		size_t x0 = (x1 + m_worldWidth - 1) % m_worldWidth;
		size_t x2 = (x1 + 1) % m_worldWidth;

		ubyte aliveCells = countAliveCells(x0, x1, x2, y0, y1, y2);
		m_resultData[y1 + x1] =
			aliveCells == 3 || (aliveCells == 2 && m_data[x1 + y1]) ? 1 : 0;
	};

	Concurrency::parallel_for<size_t>(0, m_worldWidth * m_worldHeight, 1, evaluateCell);
	std::swap(m_data, m_resultData);
}

CPU performance evaluation

Just for curiosity, I compared performance of serial and parallel CPU implementations. My CPU is Intel Xeon E5530 @ 2.40GHz with 4 cores (8 threads). I was hoping that parallel implementation will be more than 4 times faster since my machine has 4 physical cores and 8 threads.

The performance was measured on the world that contained from 213 (8 thousands) to 232 (4 billions) cells. The reported values are medians from five independent measurements 16 iterations of life was performed in every measurement and total time was divided by 16 to get an average time per iteration. The results of CPU benchmark are shown in Figure 4 and as you can see the parallel implementation is roughly two times faster.

I was quite surprised that the speedup was even less than three times. Low speedup is probably caused by low memory throughput since the code does a lot of memory I/O and not too many arithmetic operations.

You can also see a little bit different speedups for the static function versus the lambda function. I am not very sure what is causing the difference, however, for larger worlds, the static function is slightly faster.

Figure 4: Evaluation speed of serial and parallel CPU implementations.

Simple GPU implementation

All GPU implementations described in this article uses the CUDA technology. It is not the only technology for using GPUs for general computing, others are for example OpenCL or DirectCompute. However, this project was made for CUDA class so the choice of technology was easy.

Simple GPU implementation is very similar to the parallel CPU implementation. The only significant difference is added for-cycle in the CUDA kernel. This cycle ensures that if the kernel is invoked with less threads then necessary, every thread will loop through more cells to compute all of them. Since the number of cells can go easily to very high numbers (billions), this approach is necessary. Code listing 7 shows the CUDA kernel.

The code that allocates and copies memory from the CPU to GPU is not shown here because there is nothing special, just cudaMalloc, cudaMemcpy and cudaFree. Well, actually cudaMalloc and cudaFree are called only when size of the world changes. This avoids unnecessary allocating, deallocating and copying of memory.

Code listing 7: Baisc GPU implementation of Conway's Game of Life using CUDA.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
__global__ void simpleLifeKernel(const ubyte* lifeData, uint worldWidth,
		uint worldHeight, ubyte* resultLifeData) {
	uint worldSize = worldWidth * worldHeight;

	for (uint cellId = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;
			cellId < worldSize;
			cellId += blockDim.x * gridDim.x) {
		uint x = cellId % worldWidth;
		uint yAbs = cellId - x;
		uint xLeft = (x + worldWidth - 1) % worldWidth;
		uint xRight = (x + 1) % worldWidth;
		uint yAbsUp = (yAbs + worldSize - worldWidth) % worldSize;
		uint yAbsDown = (yAbs + worldWidth) % worldSize;

		uint aliveCells = lifeData[xLeft + yAbsUp] + lifeData[x + yAbsUp]
			+ lifeData[xRight + yAbsUp] + lifeData[xLeft + yAbs] + lifeData[xRight + yAbs]
			+ lifeData[xLeft + yAbsDown] + lifeData[x + yAbsDown] + lifeData[xRight + yAbsDown];

		resultLifeData[x + yAbs] =
			aliveCells == 3 || (aliveCells == 2 && lifeData[x + yAbs]) ? 1 : 0;
	}
}

Code listing 8 shows the code for kernel invocation. Notice the usage of the same trick with std::swap that is used in the CPU implementation, two buffers are allocated and they are swapped between iterations. This means that there is no memory copying of data between the CPU and GPU between consecutive iterations.

Number of threads per block is a parameter because I will run more experiments with different settings of threads per block to see what configuration performs the best.

Code listing 8: Invocation of CUDA kernel of bacis GPU implementation.
1
2
3
4
5
6
7
8
9
10
11
12
void runSimpleLifeKernel(ubyte*& d_lifeData, ubyte*& d_lifeDataBuffer, size_t worldWidth,
		size_t worldHeight, size_t iterationsCount, ushort threadsCount) {
	assert((worldWidth * worldHeight) % threadsCount == 0);
	size_t reqBlocksCount = (worldWidth * worldHeight) / threadsCount;
	ushort blocksCount = (ushort)std::min((size_t)32768, reqBlocksCount);

	for (size_t i = 0; i < iterationsCount; ++i) {
		simpleLifeKernel<<<blocksCount, threadsCount>>>(d_lifeData, worldWidth,
			worldHeight, d_lifeDataBuffer);
		std::swap(d_lifeData, d_lifeDataBuffer);
	}
}

GPU performance evaluation

The CUDA kernel can be invoked with different settings of threads per block (tpb). Different values results in significantly different performance. Performance analysis shown in Figure 5 was run to determine the best value. Reported times are medians from 15 runs. For 32 threads per bock, the GPU is not saturated and the performance is not the best. All other settings are quite comparable.

The GPU that was used for all benchmarks is GeForce GTX 580 with 1.5 GB of graphics memory and 512 CUDA cores. Comparison with the CPU is in following section.

Figure 5: Evaluation speed of basic GPU implementation with different threads per block (tpb) settings.

CPU vs. GPU

The final performance comparison of the basic life evaluation algorithm is show in Figure 6. The comparison is presented as a speedup over the serial CPU implementation because the serial CPU is very stable at 50 million of evaluated cells per second.

As you can see in the Figure 6, the GPU is significantly faster than the CPU. The speedup of the GPU algorithm over the serial CPU algorithm is reaching 143x compared to speedup of parallel CPU that is just 2x.

The evaluation of the cells is very heavy on memory I/O and not so much on arithmetic operations. The superiority of the GPU comes from its very fast memory chips with much higher memory throughput than CPU's RAM. Note that the reported speedup of the GPU is over the serial CPU implementation, speedup over the parallel CPU implementation is about 70x.

Next chapter talks about improved algorithm that reaches even higher speedups for both CPU and GPU implementations.

Figure 6: Speedup of parallel CPU implementation and GPU implementation with 128 threads per block (tpb) over serial CPU implementation.
This post is licensed under CC BY 4.0 by the author.

© Marek Fiser. Some rights reserved.

Inspired by the Chirpy theme despite not using Jekyll.