Preparing data for instanced rendering

I’m making a game for PC that involves a large terrain with forests, so there are many trees, rocks etc. to be rendered. Instanced rendering fits well for low-poly meshes that
appear in large quantities, but maybe not so well for high poly objects.

I’m planning on making a renderer that supports both instanced rendering and traditional one, but I’m not sure how to store/prepare the object data.

I’ve implemented instancing before by using glUniform* to load an array of transformation matrices. The problem I’m trying to figure out is if I should
(1) first pack all object transformations into a single contiguous array and then call glUniform* once per instance group, or
(2) call glUniform* for each object transformation matrix separately before instanced draw call?

If I always had a contigous array of transformations, the obvious choice would be (1), but in practice they are not, because not all objects are visible
at the same time. So making a contiguous array for use in (1) would cost some CPU time each frame (and also some additional memory). Copying a 1000 matrices
into a single contiguous array every frame seems like a lot of work for the CPU, but is it still better than option (2)?

I could also arrange my objects into batches according to, e.g., quadtree leaves and then in those batches maintain a contiguous array of transformations.
The problem is that the granularity of those batches may not be optimal for all objects. This is also a big design desicion.

I made a small test program to estimate the time of copying 4-by-4 matrices to a buffer:

#include <time.h>
#include <iostream>
#include <math.h>
#include <memory.h>
#include <time.h>
#include <vector>
#include <algorithm>

using namespace std;

long int getTimeMilliSec()
{
timespec spec;

clock_gettime(CLOCK_MONOTONIC, &spec);

return ((long int)spec.tv_sec)*1000 + ((long int)spec.tv_nsec)/1000000;
}

// 4x4 matrix
struct mat4
{
float data[16];
};

// Renderable object with dummy data and transformation matrix.
struct CObject
{
int data[10];
mat4 trans;
float data2[20];
};

int main(int argc, char **argv)
{
// Number of objects in the world.
const int nobjects = atoi(argv[1]);

// Number of visible object in a frame.
const int ninstances = atoi(argv[2]);

// A buffer to hold transformations.
mat4 buffer[ninstances];

// Create objects.
CObject *objects = new CObject[nobjects];

// Indices to objects.
std::vector<int> indices;

// Make random indices. In reality these would be given by
// a visibility culling algorithm.
srand (time(NULL));
for(int i=0; i<ninstances; i++)
indices.push_back(rand()%nobjects);

long int time = getTimeMilliSec();

// Copy transformations into a buffer.
for(int i=0; i<ninstances; i++)
memcpy(buffer[i].data, objects[indices[i]].data, sizeof(mat4));

cout << getTimeMilliSec()-time << endl;

delete [] objects;
}

Compile with

g++ -lrt test.cpp

For 50k objects and 4k instances

./a.out 50000 4000

result is 2 ms. For 500k objects and 10k instances

./a.out 500000 10000

result is 6 ms. The timing is probably very inaccurate as there is some
fluctuation between different runs (also the randomness may affect cache hits).

I replaced the last bit of code with

// Copy transformations into a buffer.
const int nrepeats = 10000;

for(int j=0; j<nrepeats; j++){
for(int i=0; i<ninstances; i++)
memcpy(buffer[i].data, objects[indices[i]].data, sizeof(mat4));
}

cout << float(getTimeMilliSec()-time)/nrepeats << endl;

To average the timing over 10k tests. Now 500k objects and 10k instances
times on average 0.2 ms. I know that timing code execution is not trivial and both of my
timing approaches might be invalid. 6 ms would be unacceptable whereas 0.2 ms would
be totally fine.