Conway's Game of Life on GPU using CUDA

Posted 1 Mar 2013

By Marek Fiser 37 min read

Chapter 6: Conclusion

This project presented three different conventional evaluation algorithms for Conway's Game of Life: basic algorithm, bit counting algorithm, and lookup table algorithm. All three algorithms were carefully implemented on both CPU and GPU and thorough benchmark was performed.

The best algorithm for the CPU turned out to be the lookup table that yields speedups up to 45× over the serial CPU. The best algorithm for the GPG was the bit counting with speedup of 480× over the serial CPU. Following section shows detailed graphs and comparison of all algorithms.

Hardware

CPU: Xeon E5530 @ 2.40GHz with 4 cores (8 threads)

GPU: GeForce GTX 580 with 1.5 GB of graphics memory and 512 CUDA cores

Overall performance evaluation

Figure 26 presents absolute evaluation speeds in millions of evaluated cells per second. The GPU is able to evaluate amazing 28 trillion cells per second. This number is hard to imagine. I like to compare fast things with fast things, say with the speed of light – nearly 300,000 km/s, that's 1,000,000,000 km/h (671,000,000 mph).

How far can light travel until one life cells is evaluated (on average)?

The distance per evaluated cell dpc can be computed as:

(1)dpc = c/es = 299,792,458/24,688,686,207 = 0.01214 m, where

dpc is distance per evaluated cell in meters,
c is speed of light in m/s, and
es is evaluation speed in cells/s.

One cell is evaluated for every 0.01214 meters passed by light! That's 12 mm or 0.48 in! That is quite mind-blowing. Let me know if you have any conventional (no hash-life) algorithm that performs significantly better on comparable hardware.

Figure 26: Evaluation speed of all described algorithms with their best settings.

Last two charts presents speedup over the serial CPU (Figure 27) and over the parallel CPU (Figure 28). The GPU is reaching speedup of 480× over the serial CPU or 235× over the parallel CPU algorithms. I consider those speedups as satisfying results.

Figure 27: Speedup of all described implementations over basic serial CPU implementation.

Figure 28: Speedup of all described implementations over basic parallel CPU implementation.

Code and supplementary material

You can find full source code published under Public Domain on the GitHub.

projects

This post is licensed under CC BY 4.0 by the author.