DFPSR»Blog

Now CPU rendering in 453 FPS

An old trick still useful today
Added another planned optimization to the Sandbox example. Using dirty rectangles, I can avoid redrawing the height, diffuse and normal background where nothing has moved. I worried that the frame-rate would feel uneven because it increases the difference between worst and average case, but at these speeds, a one fourth's frame's jitter is lost to tearing against the screen's 144Hz refresh rate anyway.

Results
Reached 453 frames per second instead of 295 for 800x600 pixels. With 185 FPS for 1920x1080 pixels, I'm pushing the limits of my 144Hz gaming monitor and the extra fast DisplayPort cable.



Might sound like overkill
but getting people to use CPU rendering again requires some sick performance to be convincing, because the GPU's only advantage is performance. More optimization can be done by computing deferred light using a smaller buffer one block at a time. It's currently a naive implementation thrashing lots of cache memory. These blocks would also allow skipping light sources based on early light culling tests.

What to do next
Ray-tracing cards use parallelism to acheive soft shadows by letting each pixel have another light direction. With a CPU's higher frequency, one can go for higher speeds instead and let light sources shake faster than the eye can see and then smooth out the result using temporal motion blur. A sliding window integral implementation would allow creating stable distance adaptive shadows using a pre-defined set of light offsets. As long as the light being subtracted has the same relative offset as the new light being added, the result should look stable without the need for noise reduction filters. An exponentially fading temporal blur would be cheaper, but the perceived smoothness is relative to the most visible frame.

Screen-space ambient occlusion can be stored relative to the world using the same technique as the background. By only updating ambient light around places where something moved, the cost will be close to nothing for soft shadows.

Once I'm done optimizing, this will be enough to challenge Nvidia's latest RTX cards in terms of resolution, frame-rate and visual quality.
Mārtiņš Možeiko, Edited by Mārtiņš Možeiko on
> Once I'm done optimizing, this will be enough to challenge Nvidia's latest RTX cards in terms of resolution, frame-rate and visual quality.

The difference with "Nvidia's latest RTX cards" is that it will get same or better performance by leaving CPU free for you to do whatever else things game logic need to do. In software rendering you are hogging all the CPU and not leaving much free time to do advanced physics or whatever. Sure, it can get same or similar graphics quality, but not game experience. Also battery savings on laptops. So "challenge" here applies to very restricted use case.
Sure, but you also won't have to buy a dedicated GPU to begin with and can spend the money on more CPU cores.
Edited by Dawoodoz on
64-core ARM CPUs exist for servers, so a version optimized for games would give a good enough power efficiency. This renderer is already optimized for ARM NEON in case that it becomes mainstream in the future.

https://www.top500.org/news/chine...ils-speedy-64-core-arm-processor/
Mārtiņš Možeiko, Edited by Mārtiņš Možeiko on
That is pretty weak CPU. ARM CPUs are made for power efficiency, not performance. Intel 10th generation Comet Lake i9-10900K can do 460 Gflops with just 10 cores. When this ARM cpu claims 512 Gflops with 64 cores, so almost 6x less performance per core. And these are just flops numbers, not taking into account other x86 architecture advantages over ARM. Also I'm pretty sure this ARM CPU costs way much more than latest Intel CPU. Take any server dual slot motherboard with two latest gen Intel CPU's, and you will easily beat any ARM server chip by a huge margin. Not 10x, but something like 3x is reasonable to expect.

Neon is already mainstream. Half of the world if not more are carrying Neon capable device in their pockets - smartphones. All non-deprecated iPhone's have it. And almost all Android's also have it (except some very low-end ones).

Here's a better ARM CPU server: https://www.anandtech.com/show/15...r-two-a64fx-nodes-in-a-2u-for-40k
two 48 cores, total 2764 Gflop. But costs $40K. Homework - calculate what performance you can get with this price using Intel CPU's.
This is huge !

I downloaded and played with the lib a little and the results are really impressive.