Modern OpenGL features and GPU bottleneck

Hello.
I am working on real-time GPU Raytracer using OpenGL.

I am targeting OGL4+ GPUs(including Nvidia Fermi which doesn’t support bindless graphics).
My engine is always GPU bottlenecked and uses uniforms without uniform blocks, glDrawElements and traditional texture bindings.
Using Intel Core i7 950 + Nvidia GeForce 690 GTX.
CPU: 3-4 ms. GPU 15-25 ms(Global Illumination).

I plan to add support for Uniform buffers, MultiDrawIndirect and TextureArrays to reduce GPU overhead.
Can these features potentially improve GPU performance, or it is just a CPU optimization, which will move some tasks to the GPU and reduce performance?

Wait. NV bindless “is” supported on Fermi cards. In fact, it’s even supported on cards that are pre-Fermi all the way back to at least GeForce 8.

My engine is always GPU bottlenecked and uses uniforms without uniform blocks, glDrawElements and traditional texture bindings.
Using Intel Core i7 950 + Nvidia GeForce 690 GTX.
CPU: 3-4 ms. GPU 15-25 ms(Global Illumination).

I plan to add support for Uniform buffers, MultiDrawIndirect and TextureArrays to reduce GPU overhead.
Can these features potentially improve GPU performance, or it is just a CPU optimization, which will move some tasks to the GPU and reduce performance?

Before you jump to techniques, it sounds like you first need to do a bottleneck analysis. First, what is your performance goal? Once you reach it, stop! Next, what about your processing is consuming the biggest amount of frame time? Are you compute bound? Are you memory bound? Are your threads very divergent? And specifically what about what you’re doing is making that the bottleneck.

Once you have that, you can look at techniques to optimize that bottleneck.

Since you said ray tracing, I’ll venture a guess. Past the primary rays, full ray tracing (generally speaking) is very divergent, and GPUs aren’t good at that. So my first guess is that you might have a big problem with thread divergence. Also, general ray tracing involves lots of spatial queries, which is very memory bandwidth intensive. GPUs hide memory latency by having a bunch of threads running and swap other threads in while others are waiting on mem accesses. However, if all your threads are waiting on memory reads, …well, you get the idea. So you could be memory bound.

So anyway, I’d suggest you profile first (tried NSight?). Then once you know the largest bottleneck, figure out what you can do about it.

As to the techniques you mention, the first two probably won’t help you much; they’re largely to get rid of CPU overhead and avoid GPU pipeline bubbles. The latter TBD, depending on whether your algorithm and your card are faster with texture arrays than a bunch of individual textures.