Performance problems with GL_MAP_PERSISTENT_BIT

At Beyond porting it is stressed that GL_MAP_UNSYCHRONIZED_BIT should not be used because it causes a sync between the client and driver threads. So I dropped that. Next, I wanted to save on making the glMapBufferRange() and glUnmapBuffer() calls, so I tried to use GL_MAP_PERSISTENT_BIT. I used it in two cases, and came up against a problem.

In the first case, I used it for the uniform buffers that holds the transform matrices, together with GL_MAP_FLUSH_EXPLICIT_BIT and calls to glFlushMappedBufferRange(). It seemed to work without any performance issue, even though I’m uploading per object before each draw call. I assume that any performance difference is small that other factors dominate.

In the second case, I tried it in the following context: I have shared memory where another process draws an HD resolution 32bpp image, on average once per frame. Whenever there’s a new image, I upload it to a PBO and from there to a texture (as the latter is asynchronous)–there are actually two PBOs that I ping-pong between. What happened is that when I changed from map-memcpy-unmap to a persistent mapping and then memcpy-flush, as I had done with the uniform buffers in my first test case, the performance dropped a lot. Note that this happened with any combination of other flags I tried. I tried flushing both right after the memcpy, and instead right before the use of the data to load into texture from the PBO. I tried no explicit flushing. I tried putting GL_MAP_UNSYNCHRONIZED bit in again. I tried GL_MAP_COHERENT_BIT. I also tried to use fences (one for each PBO) set after the use of the buffer to load into texture and corresponding glClientWaitSync() before the memcpy into it. I tried orphaning with GL_MAP_INVALIDATE_BUFFER_BIT (though I’m not sure it makes sense for the large amount of data being transferred). I tried these in various combinations, but in the end, I simply could not get the performance back to what it was with the map-memcpy-unmap.

What am I missing? I’m running this on an NVIDIA GTX680 with the 332.21 driver (Windows 7 x64).

Good question! You got me. I haven’t tried that trick Cass and John talk about in that presentation. Thanks for posting a link BTW! I hadn’t seen that one yet (fresh off the presses a few weeks ago).

Currently I’m an UNSYNC+INVALIDATE_RANGE / INVALIDATE_BUFFER addict. But I need to try what they suggest to see if it’s truly worth kicking UNSYNC to the curb.

You might cook a short GLUT test program that can be easily flipped back and forth between the two methods. You’re guaranteed to get a number of folks trying it, tweaking it, and posting their results to the forum for you to see.

Try GL_MAP_PERSISTENT_BIT with Immutable Storages and not normal Buffers/PBOs. It should works.

I am using immutable storage: I’m creating the buffers with glNamedBufferStorageEXT()

You may be interested in these results for a series of line draws

I do the following with GL_MAP_UNSYNCHRONIZED_BIT




for 200 times
  map vertex buffer
  copy in 200,000 vertices
  unmap buffer
  draw GL_LINES


swap render buffer


glFenceSync
glWaitSync



I repeat this with the method suggested by Cass using a triple size buffer
and it ran 20-30% faster.

The numbers bounced around a lot more than with GL_MAP_UNSYNCHRONIZED_BIT but were never slower

My timers are a bit crude but the speed improvement was noticable.

My fastest test was with glBegin/glEnd which was about 40% faster but I was running in debug mode!

EDIT:
I tried these tests in release mode and GL_MAP_PERSISTENT_BIT and glBegin/glEnd are on par both about 40% faster than GL_MAP_UNSYNCHRONIZED_BIT

That’s impressive. Thanks for posting your test results.