Are two OpenGL contexts still necessary for concurrent copy and render?

Looking at http://on-demand.gputechconf.com/gtc/2012/presentations/S0356-Optimized-Texture-Transfers.pdf
Are the two contexts required? Will rendering not occur while the DMA transfer is proceeding, unless I do the upload in another thread, even with last-gen NVIDIA cards? If so, how does that make sense? It seems an artificial limitation, as the hardware obviously can handle it (even in the single copy engine consumer-level cards) if you have another thread.

(If it matters, I’m using persistently mapped PBOs.)

The OpenGL® API is inherently single threaded. To issue OpenGL® commands, you need a context bound and the commands will operate on the currently bound context.
A single context can only be bound to exactely one thread at any time and one thread can only have exactely one context bound.

Except of course if you have the GLX_MESA_multithread_makecurrent extension or something similar.

Concurrent loading from a second thread while the first keeps on rendering requires that the second thread has its own context and that the two contexts share resources.

Perhaps my question was not clear, or you did not look at the link. I didn’t ask about multithreading but about asynchronous operation–these are different concepts, and the one-thread-per-GL-context is a red herring. A DMA transfer doesn’t block a CPU thread because it only involves the CPU in triggering it, so there’s no reason provided why it should need to be initiated in a different thread.

The article notes that the last few generations of NVIDIA cards have copy engines, which allow DMA transfers (such as from mapped buffers) to proceed concurrently with the GPU rendering. However, it implies that the transfer is only asynchronous (as in, concurrent with rendering, on the GPU side–nothing to do with client threads) if initiated in another GL context. My question is why that is, and if it’s still the case on latest generation hardware.

Up to my knowledge it is still so, and let’s try to answer why it is so…

  1. NV Dual Copy Engine is not free. Everything in the world has its price. There is an overhead in initialization and synchronization when NV Dual Copy Engine is activated. For the small transfers better result is achieved when it is off. So, by default it is off.

  2. Since NV Dual Copy Engine is off by default there should be a way to activate it if needed.

  3. There is no special command for turning on. Drivers use heuristics to figure out when to do that. The trigger is a transfer in a separate context (a special dedicated context just for transferring data).

  4. According to answer on the GeForce forum (because there is no official statement in any documentation), or device querying with CUDA API, Kepler in GeForce cards also has one copy engine. So, architecture is not changed.

All I wrote is according to what I read. I’m not a driver developer. It would be nice if someone could confirm or correct my post (but it is not likely since no NV driver developer posted anything on this forum for couple of years).

P.S. I, personally, would like to know what happens when NV Dual Copy Engine accesses texture currently used in drawing. It is probably not a regular case but it should be allowed (if not) when both engines access different locations. That’s something that worked on SGI graphics workstations 16 years ago, I guess.

I thought consumer cards have only a single copy engine, not dual–at least, only one active at a time–from http://www.nvidia.com/docs/IO/40049/Dual_copy_engines.pdf: “Having two separate threads running on a Quadro graphics card with the consumer NVIDIA Fermi architecture or running on older generations of graphics cards the data transfers will be serialized resulting in a drop in performance.” First, I can’t tell from this if they mean if the serialization is only if there are two transfers, or transfer/render is also serialized. Note that their example is overlapping upload/render/download.
Now, I’m not interested in 3-way overlapping, just 2-way (upload/render). So I’m not asking about “Dual Copy Engine” being activated. I’m asking about one copy engine being activated. Is a second GL context/client thread still necessary? I can’t tell from the information presented this far.

I’m sorry if I was not clear. Quadro cards have two DMA channels while GeForce cards and low-end Quadros have just one, but the principle is the same. Pre-Fermi cards have none. So, on pre-Fermi cards everything is serialized, on GeForce transfer and rendering can overlap (2-way overlapping), while on Quadros two transfers (upload and download) and rendering can overlap (3-way overlapping).

3-way overlapping is not possible on GeForce cards anyway. Considering the second context, it is as necessary today as it was at the moment of copy engine introduction. How can you tell the driver to activate separate DMA channel on a graphics card otherwise?

I don’t think a second Context is a good choice because all sync operations are left to OpenGL/Driver and could be not a good way in some cases.
I prefer to “waste” a bit of time and make small procedures/functions to use Async transfers (double object’s buffer or orphaning).

However some OpenGL “objects” are not shareable between contexts, so it could be a great limitation.

Do you mean with GL_MAP_UNSYNCHRONIZED_BIT? That’s driver to GPU asynchronous, but causes client thread and the driver server thread to synchronize: Beyond porting | PPT page 9 “It’s quite expensive (almost always needs to be avoided)”

On nvidia drivers specifically. Others don’t necessarily have that issue.

Does this upload overlapping copy engine trigger only apply to glTexSubimage()? What about (persistently) mapped buffer objects? As I use indirect rendering, I use possibly large buffer objects to store all transforms, materials, and other per-draw data. I’d like to know whether the copy engine/DMA transfer concurrent with rendering only applies to the texture upload and not buffer object upload, or the same dual context/thread trick would trigger concurrent uploading for buffer objects as well, or whether buffer objects already do DMA transfer concurrently even in the same context/thread as rendering (until glMemoryBarrier(GL_BUFFER_UPDATE_BARRIER_BIT)).
?

Does the silence mean no one knows? :frowning:

One of the reasons I hate GL_ARB_buffer_storage is that you can’t specify the memory location except for the stupid hint.
If you could it would be easy to use one system memory buffer, access it with easy read/write and then use glCopyBufferSubData (ARB_copy_buffer) to move it over to a vram buffer. That would pretty much ensure that the DMA copy engine would be used and no CPU/GPU cycles would be wasted.

Why don’t you try and share with us the findings. Catch for a few seconds what happens on your system when the transfer is active, and analyze with GPUView.
What I can confirm is that Fermi really has three hardware queues, and some actions are done in parallel (other queues are used for Desktop Window Manager in the cases I saw).