Performance issue with glDrawArraysInstanced

Hello everyone,
I’m trying to implement an OpenGL4 instanced drawing algorithm where each instance is composed by a single triangle.
The main reasons why I want to implement this kind of algorithm are:
[ul]
[li]the ability to use less GPU memory in the frequent scenario where colors are given on a per-triangle basis and not on a per vertex basis[/li][li]the ability to perform per-triangle computations without using geometry shaders which, from my experiments, dramatically slow down the whole pipeline[/li][/ul]

My rendering program is composed by a vertex shader and a fragment shader. The vertex shader is as follows:


#version 400 core

layout (location = 0) in vec3 tri_p0;
layout (location = 1) in vec3 tri_p1;
layout (location = 2) in vec3 tri_p2;
layout (location = 3) in vec4 tri_colorP0;
layout (location = 4) in vec4 tri_colorP1;
layout (location = 5) in vec4 tri_colorP2;

out FRAGMENT {
	vec4 color;
} vs_out;

uniform mat4 mvp_matrix;

void main(void) {
	vec3 position;
	vec4 color;
	
	if(gl_VertexID == 0) {
		position = tri_p0;
		color = tri_colorP0;
	}
	else if(gl_VertexID == 1) {
		position = tri_p1;
		color = tri_colorP1;
	}
	else if(gl_VertexID == 2) {
		position = tri_p2;
		color = tri_colorP2;
	}
	
	vs_out.color = color;
	
	gl_Position = mvp_matrix * vec4(position, 1.0);
}

The fragment shader is instead this one:


#version 400 core

layout (location = 0) out vec4 color;

in FRAGMENT {
	vec4 color;
} fs_in;

void main(void) {
	color = fs_in.color;
}

As you can see, in my vertex shader I declare three vertex attributes for the vertex positions and three vertex attributes for the colors. All these attributes are instanced and their divisor is set to 1.

The reason why I have three color attributes is that sometimes I want to be able to have different colors for the three triangle vertices while, more often, I have a single color for the whole triangle. In this last scenario, I simply attach the three color attributes to the same VBO specifying the same stride and offset.

I wrote a test application that draws a matrix of quads, each of them composed by two triangles.
This is the code I used to initialize vertex data:


int numQuadsPerRowCol = sqrtl(NUM_TRIANGLES / 2);
numTris = numQuadsPerRowCol * numQuadsPerRowCol * 2;

float stepX = (maxX - minX) / numQuadsPerRowCol;
float stepY = (maxY - minY) / numQuadsPerRowCol;

GLfloat* positions = new GLfloat[3 * 3 * numTris];
GLfloat* colors = new GLfloat[4 * numTris];

int k = 0;
int l = 0;

for (int i = 0; i < numQuadsPerRowCol; i++) {
	for (int j = 0; j < numQuadsPerRowCol; j++) {
		GLfloat color[4];

		int id = i * numQuadsPerRowCol + j;

		color[0] = ((id & 0x00ff0000) >> 16) / 255.0;
		color[1] = ((id & 0x0000ff00) >> 8) / 255.0;
		color[2] = (id & 0x000000ff) / 255.0;
		color[3] = 1.0;

		for (int t = 0; t < 2; t++) {
			for (int c = 0; c < 4; c++) {
				colors[l + c] = color[c];
			}
			l += 4;
		}

		GLfloat xLeft = minX + j * stepX;
		GLfloat xRight = minX + (j + 1) * stepX;
		GLfloat yBottom = minY + i * stepY;
		GLfloat yTop = minY + (i + 1) * stepY;

		//first triangle positions
		positions[k++] = xLeft;
		positions[k++] = yTop;
		positions[k++] = 0;

		positions[k++] = xLeft;
		positions[k++] = yBottom;
		positions[k++] = 0;

		positions[k++] = xRight;
		positions[k++] = yBottom;
		positions[k++] = 0;

		//second triangle positions
		positions[k++] = xLeft;
		positions[k++] = yTop;
		positions[k++] = 0;

		positions[k++] = xRight;
		positions[k++] = yBottom;
		positions[k++] = 0;

		positions[k++] = xRight;
		positions[k++] = yTop;
		positions[k++] = 0;
	}
}

glGenBuffers(1, &positionVbo);
glBindBuffer(GL_ARRAY_BUFFER, positionVbo);
glBufferData(GL_ARRAY_BUFFER, numTris * 3 * 3 * sizeof(float), positions, GL_STATIC_DRAW);

glVertexAttribPointer(TRI_P0, 3, GL_FLOAT, GL_FALSE, 9 * sizeof(GLfloat), NULL);
glVertexAttribDivisor(TRI_P0, 1);
glEnableVertexAttribArray(TRI_P0);

glVertexAttribPointer(TRI_P1, 3, GL_FLOAT, GL_FALSE, 9 * sizeof(GLfloat), (void *)(3 * sizeof(GLfloat)));
glVertexAttribDivisor(TRI_P1, 1);
glEnableVertexAttribArray(TRI_P1);

glVertexAttribPointer(TRI_P2, 3, GL_FLOAT, GL_FALSE, 9 * sizeof(GLfloat), (void *)(6 * sizeof(GLfloat)));
glVertexAttribDivisor(TRI_P2, 1);
glEnableVertexAttribArray(TRI_P2);

glGenBuffers(1, &colorVbo);
glBindBuffer(GL_ARRAY_BUFFER, colorVbo);
glBufferData(GL_ARRAY_BUFFER, numTris * 4 * sizeof(float), colors, GL_STATIC_DRAW);

//All color attributes are attached to the same VBO with the same stride and offset --> per-triangle colors
glVertexAttribPointer(TRI_COLOR_P0, 4, GL_FLOAT, GL_FALSE, 0, NULL);
glVertexAttribDivisor(TRI_COLOR_P0, 1);
glEnableVertexAttribArray(TRI_COLOR_P0);

glVertexAttribPointer(TRI_COLOR_P1, 4, GL_FLOAT, GL_FALSE, 0, NULL);
glVertexAttribDivisor(TRI_COLOR_P1, 1);
glEnableVertexAttribArray(TRI_COLOR_P1);

glVertexAttribPointer(TRI_COLOR_P2, 4, GL_FLOAT, GL_FALSE, 0, NULL);
glVertexAttribDivisor(TRI_COLOR_P2, 1);
glEnableVertexAttribArray(TRI_COLOR_P2);

glBindBuffer(GL_ARRAY_BUFFER, 0);

As you can see I use a single VBO for positions but each position attribute is connected to the VBO
using a different offset.

For colors, I use a single VBO and all color attributes are connected using the same stride and offset
(thus achieving per-triangle colors instead of per-vertex colors).

The rendering loop is as follows:


glUseProgram(render_program);

glUniformMatrix4fv(uniforms.mvp_matrix, 1, GL_FALSE, proj_matrix * view_matrix);

glDrawArraysInstanced(GL_TRIANGLES, 0, 3, numTris);

I tested the application on an integrated Intel HD 4400 card and on an Nvidia GeForce GT 750M card.
Surprisingly, the performances are way better on the Intel card than on the Nvidia one. Here are some fps stats:

800000 triangles:
Intel: 140 fps
Nvidia: 31fps

1600000 triangles:
Intel: 74 fps
Nvidia: 16 fps

To better understand the issue, I profiled the application under windows using GPUView. I noticed quite a different behavior between Intel and Nvidia.

Intel generates a single big DMA packet (8 kB) per frame that gets executed quite fast. Nvidia, instead, generated a way bigger number of small packets (4-8 bytes) at each frame that get queued up and, for this reason, they have to wait a lot of time before being executed.

This information made me wonder whether this might be an Nvidia driver bug.

Does anybody have any advice on how to improve performance on the Nvidia card?