Name INTEL_performance_query Name Strings GL_INTEL_performance_query Contact Tomasz Madajczak, Intel (tomasz.madajczak 'at' intel.com) Contributors Piotr Uminski, Intel Slawomir Grajewski, Intel Status Complete, shipping on selected Intel graphics. Version Last Modified Date: December 20, 2013 Revision: 3 Number OpenGL Extension #443 OpenGL ES Extension #164 Dependencies OpenGL dependencies: OpenGL 3.0 is required. The extension is written against the OpenGL 4.4 Specification, Core Profile, October 18, 2013. OpenGL ES dependencies: This extension is written against the OpenGL ES 2.0.25 Specification and OpenGL ES 3.0.2 Specification. Overview The purpose of this extension is to expose Intel proprietary hardware performance counters to the OpenGL applications. Performance counters may count: - number of hardware events such as number of spawned vertex shaders. In this case the results represent the number of events. - duration of certain activity, like time took by all fragment shader invocations. In that case the result usually represents the number of clocks in which the particular HW unit was busy. In order to use such counter efficiently, it should be normalized to the range of <0,1> by dividing its value by the number of render clocks. - used throughput of certain memory types such as texture memory. In that case the result of performance counter usually represents the number of bytes transferred between GPU and memory. This extension specifies universal API to manage performance counters on different Intel hardware platforms. Performance counters are grouped together into proprietary, hardware-specific, fixed sets of counters that are measured together by the GPU. It is assumed that performance counters are started and ended on any arbitrary boundaries during rendering. A set of performance counters is represented by a unique query type. Each query type is identified by assigned name and ID. Multiple query types (sets of performance counters) are supported by the Intel hardware. However each Intel hardware generation supports different sets of performance counters. Therefore the query types between hardware generations can be different. The definition of query types and their results structures can be learned through the API. It is also documented in a separate document of Intel OGL Performance Counters Specification issued per each new hardware generation. The API allows to create multiple instances of any query type and to sample different fragments of 3D rendering with such instances. Query instances are identified with handles. New Procedures and Functions void GetFirstPerfQueryIdINTEL(uint *queryId); void GetNextPerfQueryIdINTEL(uint queryId, uint *nextQueryId); void GetPerfQueryIdByNameINTEL(char *queryName, uint *queryId); void GetPerfQueryInfoINTEL(uint queryId, uint queryNameLength, char *queryName, uint *dataSize, uint *noCounters, uint *noInstances, uint *capsMask); void GetPerfCounterInfoINTEL(uint queryId, uint counterId, uint counterNameLength, char *counterName, uint counterDescLength, char *counterDesc, uint *counterOffset, uint *counterDataSize, uint *counterTypeEnum, uint *counterDataTypeEnum, uint64 *rawCounterMaxValue); void CreatePerfQueryINTEL(uint queryId, uint *queryHandle); void DeletePerfQueryINTEL(uint queryHandle); void BeginPerfQueryINTEL(uint queryHandle); void EndPerfQueryINTEL(uint queryHandle); void GetPerfQueryDataINTEL(uint queryHandle, uint flags, sizei dataSize, void *data, uint *bytesWritten); New Tokens Returned by the capsMask parameter of GetPerfQueryInfoINTEL PERFQUERY_SINGLE_CONTEXT_INTEL 0x0000 PERFQUERY_GLOBAL_CONTEXT_INTEL 0x0001 Accepted by the flags parameter of GetPerfQueryDataINTEL PERFQUERY_WAIT_INTEL 0x83FB PERFQUERY_FLUSH_INTEL 0x83FA PERFQUERY_DONOT_FLUSH_INTEL 0x83F9 Returned by GetPerfCounterInfoINTEL function as counter type enumeration in location pointed by counterTypeEnum PERFQUERY_COUNTER_EVENT_INTEL 0x94F0 PERFQUERY_COUNTER_DURATION_NORM_INTEL 0x94F1 PERFQUERY_COUNTER_DURATION_RAW_INTEL 0x94F2 PERFQUERY_COUNTER_THROUGHPUT_INTEL 0x94F3 PERFQUERY_COUNTER_RAW_INTEL 0x94F4 PERFQUERY_COUNTER_TIMESTAMP_INTEL 0x94F5 Returned by glGetPerfCounterInfoINTEL function as counter data type enumeration in location pointed by counterDataTypeEnum PERFQUERY_COUNTER_DATA_UINT32_INTEL 0x94F8 PERFQUERY_COUNTER_DATA_UINT64_INTEL 0x94F9 PERFQUERY_COUNTER_DATA_FLOAT_INTEL 0x94FA PERFQUERY_COUNTER_DATA_DOUBLE_INTEL 0x94FB PERFQUERY_COUNTER_DATA_BOOL32_INTEL 0x94FC Accepted by the parameter of GetIntegerv: PERFQUERY_QUERY_NAME_LENGTH_MAX_INTEL 0x94FD PERFQUERY_COUNTER_NAME_LENGTH_MAX_INTEL 0x94FE PERFQUERY_COUNTER_DESC_LENGTH_MAX_INTEL 0x94FF Accepted by the parameter of GetBooleanv: PERFQUERY_GPA_EXTENDED_COUNTERS_INTEL 0x9500 Add new Section 4.4 to Chapter 4, Event Model for OpenGL 4.4 Add new Section 2.18 to Chapter 2, OpenGL ES Operation for OpenGL ES 3.0.2 4.4 Performance Queries (for OpenGL 4.4) 2.18 Performance Queries (for OpenGL ES 3.0.2) Hardware and software performance counters can be used to obtain information about GPU activity. Performance counters are grouped into query types. Different query types can be supported on different hardware platforms and/or driver versions. One or more instances of the query types can be created. Each query type has unique query ID. Query ids supported on given platform can be queried in the run-time. Function: void GetFirstPerfQueryIdINTEL(uint *queryId); returns the identifier of the first performance query type that is supported on a given platform. The result is passed in location pointed by queryId parameter. If the given hardware platform doesn't support any performance queries, then the value of 0 is returned and INVALID_OPERATION error is raised. If queryId pointer is equal to 0, INVALID_VALUE error is generated. Next query ids can be queried by multiply call to the function: void GetNextPerfQueryIdINTEL(uint queryId, uint *nextQueryId); This function returns the integer identifier of the next performance query on a given platform to the specified with queryId. The result is passed in location pointed by nextQueryId. If query identified by queryId is the last query available the value of 0 is returned. If the specified performance query identifier is invalid then INVALID_VALUE error is generated. If nextQueryId pointer is equal to 0, an INVALID_VALUE error is generated. Whenever error is generated, the value of 0 is returned. Each performance query type has a name and a unique identifier. The query identifier for a given query name be read using function: void GetPerfQueryIdByNameINTEL(char *queryName, uint *queryId); This function returns the identified of the query type specified by the string provided as queryName parameter. If queryName does not reference a valid query name, an INVALID_VALUE error is generated. General description of a query type can be read using the function: void GetPerfQueryInfoINTEL(uint queryId, uint queryNameLength, char *queryName, uint *dataSize, uint *noCounters, uint *maxInstances, uint *noActiveInstances, uint *capsMask); The function returns information about the performance query specified with queryId parameter, particularly: - query name in queryName location. The maximal name is specified by queryNameLength - size of query output structure in bytes in dataSize location - number of performance counters in the query output structure in noCounters location - the maximal allowed number of query instances that can be created on a given architecture in maxInstances location. Because the other type queries are created using the same resources, it may happen that the actual amount of created instances is smaller than the returned number - the actual number of already created query instances in maxInstances location - mask of query capabilities in capsMask location. If the mask returned in capsMask contains PERFQUERY_SINGLE_CONTEXT_INTEL token this means the query supports context sensitive measurements, otherwise, if the mask contains token of GL_PERFQUERY_GLOBAL_CONTEXT_INTEL this means the query doesn't support that feature and the counters will be updated for all render contexts as they are global for hardware. If queryId does not reference a valid query type, an INVALID_VALUE error is generated. Performance counters that belong to the same query type have unique ids. Performance counter ids values start with 1. Performance counter id 0 is reserved as an invalid counter. Information about performance counters that belongs to a given query type can be read using the function: void GetPerfCounterInfoINTEL(uint queryId, uint counterId, uint counterNameLength, char *counterName, uint counterDescLength, char *counterDesc, uint *counterOffset, uint *counterDataSize, uint *counterTypeEnum, uint *counterDataTypeEnum, uint64 *rawCounterMaxValue); The function returns descriptive information about each particular performance counter that is an element of the performance query. The counter is identified with a pair of queryId and counterId parameters. The following parameters are returned: - counter name in counterName location. The maximal length of copied name is specified with counterNameLength. - counter description text in counterDesc location. The maximal length of copied text is specified with counterDescLength. - byte offset of the counter from the start of the query structure in counterOffset location. - counter size in bytes in counterDataSize location. - counter type enumeration in counterTypeEnum location. It can be one o the following tokens: PERFQUERY_COUNTER_EVENT_INTEL PERFQUERY_COUNTER_DURATION_NORM_INTEL PERFQUERY_COUNTER_DURATION_RAW_INTEL PERFQUERY_COUNTER_THROUGHPUT_INTEL PERFQUERY_COUNTER_RAW_INTEL PERFQUERY_COUNTER_TIMESTAMP_INTEL - counter data type enumeration, in counterDataTypeEnum location. It can be one o the following tokens: PERFQUERY_COUNTER_DATA_UINT32_INTEL PERFQUERY_COUNTER_DATA_UINT64_INTEL PERFQUERY_COUNTER_DATA_FLOAT_INTEL PERFQUERY_COUNTER_DATA_DOUBLE_INTEL PERFQUERY_COUNTER_DATA_BOOL32_INTEL - for some raw counters for which the maximal value is deterministic, the maximal value of the counter in 1 second is returned in the location pointed by rawCounterMaxValue, otherwise, the location is written with the value of 0. If the pair of queryId and counterId does not reference a valid counter, an INVALID_VALUE error is generated. A single instance of the performance query of a given type can be created using function: void CreatePerfQueryINTEL(uint queryId, uint *queryHandle); The handle to newly created query instance is returned in queryHandle location. If queryId does not reference a valid query type, an INVALID_VALUE error is generated. If the query instance cannot be created due to exceeding the number of allowed instances or driver fails query creation due to an insufficient memory reason, an OUT_OF_MEMORY error is generated, and the location pointed by queryHandle returns NULL. Existing query instance can be deleted using function void DeletePerfQueryINTEL(uint queryHandle); queryHandle must be a query instance handle returned by CreatePerfQueryINTEL(). If a query handle doesn't reference a previously created performance query instance, an INVALID_VALUE error is generated. A new measurement session for a given query instance can be started using function: void BeginPerfQueryINTEL(uint queryHandle); where queryHandle must be a query instance handle returned by CreatePerfQueryINTEL(). If a query handle doesn't reference a previously created performance query instance, an INVALID_VALUE error is generated. Note that some query types, they cannot be collected in the same time. Therefore calls of BeginPerfQueryINTEL() cannot be nested if they refer to queries of such different types. In such case INVALID_OPERATION error is generated. The counters may not start immediately after BeginPerfQueryINTEL(). Because the API and GPU are asynchronous, the start of performance counters is delayed until the graphics hardware actually executes the hardware commands issued by this function. However, it is guaranteed that collecting of performance counters will start before any draw calls specified in the same context after call to BeginPerfQueryINTEL(). Collecting performance counters may be stopped by a function: void EndPerfQueryINTEL(uint queryHandle); where queryHandle must be a query instance handle returned by CreatePerfQueryINTEL(). The function ends the measurement session started by BeginPerfQueryINTEL(). If a performance query is not currently started, an INVALID_OPERATION error will be generated. Similarly as in glBeginPerfQueryINTEL() case, the execution of glEndPerfQueryINTEL() is not immediate. The end of measurement is delayed until graphics hardware completes processing of the hardware commands issued by this function. However, it is guaranteed that results any draw calls specified in the same context after call to EndPerfQueryINTEL() will be not measured by this query. The query result can be read using function: void GetPerfQueryDataINTEL(uint queryHandle, uint flags, sizei dataSize, void *data, uint *bytesWritten); The function returns the values of counters which have been measured within the query session identified by queryHandle. The call may end without returning any data if they are not ready for reading as the measurement session is still pending (the EndPerfQueryINTEL() command processing is not finished by hardware). In this case location pointed by the bytesWritten parameter will be set to 0. The meaning of the flags parameter is the following: - PERFQUERY_DONOT_FLUSH_INTEL means that the call of GetPerfQueryDataINTEL() is non-blocking, which checks for results and returns them if they are available. Otherwise, (if the results of the query are not ready) it returns without flushing any outstanding 3D commands to the GPU. The use case for this is when a flush of outstanding 3D commands to GPU has already been ensured with other OpenGL API calls. - PERFQUERY_FLUSH_INTEL means that the call of GetPerfQueryDataINTEL() is non-blocking, which checks for results and returns them if they are available. Otherwise, it implicitly submits any outstanding 3D commands to the GPU for execution. In that case the subsequent call of glGetPerfQueryDataINTEL() may return data once the query completes. - PERFQUERY_WAIT_INTEL means that the call of GetPerfQueryDataINTEL() is blocking and waits till the query results are available and returns them. It means that if the query results are not yet available then it implicitly submits any outstanding 3D commands to GPU and waits for the query completion. If the measurement session indentified by queryHandle is completed then the call of GetPerfQueryDataINTEL() always writes query result to the location pointed by the data parameter and the amount of bytes written is stored in the location pointed by the bytesWritten parameter. If bytesWritten or data pointers are NULL then an INVALID_VALUE error is generated. New Implementation Dependent State Add new Table 23.75 to Chapter 23, State Tables (OpenGL 4.4) Add new Table 6.37 to Chapter 6.2, State Tables (OpenGL ES 3.0.2) Get Value Type Get Command Value Description ------------------------------ ---- ----------- ----- ------------- PERFQUERY_QUERY_NAME_LENGTH_MAX_INTEL Z+ GetIntegerv 256 max query name length PERFQUERY_COUNTER_NAME_LENGTH_MAX_INTEL Z+ GetIntegerv 256 max counter name length PERFQUERY_COUNTER_DESC_LENGTH_MAX_INTEL Z+ GetIntegerv 1024 max description length PERFQUERY_GPA_EXTENDED_COUNTERS_INTEL B GetBooleanv - extended counters available Issues 1. What is the usage model of this extension? Generally there are two approaches of measuring performance with Intel OGL Performance Queries, such as: - Per draw call measurements - performance counters can be used to assess the business of particular 3D hardware units under assumption that 3D hardware is almost 100% time busy from the CPU point of view. - Per 3D scene measurements - performance counters can be used to assess the balance of CPU and GPU processing times. Such assessment shows whether the workload is CPU whether GPU bound. 2. How per draw call measurements are performed? In the per-draw call usage model each call to the draw routine (e.g. glDrawArrays, glDrawElements) should be surrounded by a dedicated query instance. That means that each draw operation should be measured independently. It is recommended to measure the GPU performance characteristics for a single draw call to find possible bottlenecks for the application executed on a given hardware. 3. How per scene measurements are performed? The usage model assumes that one performance query instance measures a complete scene. It is recommended to figure out if the workload is CPU or GPU bound. It should be noted that: - For a longer scope of performance query the probability of 3D hardware frequency change is higher. The higher probability of frequency change causes that the larger percentage of results may be biased with gross errors. - For complicated 3D scenes the condition of render commands split is always met. Thus, to calculate an average 3D hardware unit utilization for a longer period of time it is recommended to use a larger number of per draw call queries rather than a lower number of per 3D scene queries. It is recommended to use this method when application uses full screen mode as current implementation of queries supports only global context. 4. How results of the query can be read? Results of the queries cannot be read before the entire drawing is done by the GPU. This means that the application programmer has to decide about the synchronization method it uses to read the query results. There are the following options: - Use glFlush to trigger submission of any pending commands to the GPU. Later check results availability with repetitive non-blocking calls to GetPerfQueryDataINTEL function using the synchronization flag of GL_PERFQUERY_DONOT_FLUSH_INTEL. - Use flag GL_PERFQUERY_FLUSH_INTEL in glGetPerfQueryDataINTEL to trigger submission of any pending commands to the GPU. If results are not immediately available, check their availability with repetitive non-blocking calls to GetPerfQueryDataINTEL function using the synchronization flag of GL_PERFQUERY_DONOT_FLUSH_INTEL. - Do a blocking call to glGetPerfQueryDataINTEL() with GL_PERFQUERY_WAIT_INTEL flag set. The flag ensures that any pending GPU commands are submitted and function blocks till GPU results are available. It is allowed to perform simultaneous measurements with multiple active queries of the same type. However it may be not allowed to perform simultaneous measurements of queries with different types, as it may require reprogramming of the same hardware part and could destroy the hardware settings of the previous query. 5. Are query results always accurate? There are certain hardware conditions which may cause the results of performance counters expressed in hardware clocks to be inaccurate. The conditions may include: - Render clock change - the condition usually causes that all counter values expressed in hardware clocks are incorrect. It is indicated by FrequencyChanged flag. - Render commands split - in some cases GPU has to split execution of drawing operations surrounded by the query into at least two parts. The condition usually causes that counter values expressed in time domains (in microseconds) may be substantially larger than the average values of that counter. It is indicated by SplitOccured flag. - Rendering preemption - if GPU is shared among two or more 3D applications, the hardware counters gathered in a global mode contain additive results for these applications. The condition is also indicated with SplitOccured flag. The above conditions are indicated in special fields in the query results structures. It is up to the user to decide if the results are to be processed further or dropped. In certain cases it can be determined that the render commands split condition always occurs and has to be accepted. 6. Are query results per-context or global? Some GPU platforms and/or driver versions support only global GPU counters. In such cases, the query instance has to have GL_PERFQUERY_GLOBAL_CONTEXT_INTEL flag set when creating query instance. Otherwise, creation will fail and an INVALID_OPERATION error will be generated. Support for a global context means that a single query instance measures all GPU activities performed between query start and query end. Query measures not only current OpenGL context but also activities of other OpenGL contexts, other 3D API like DX and operating system windows draw calls. Program examples 1. Reading counter meta data example // query data has proprietary predefined structure layout // associated with the vendor query ID GL_QUERY_PIPELINE_METRICS * pQueryData; uint queryId; uint nextQueryId; uint queryHandle; uint dataSize; uint noCounters; uint noInstances; uint capsMask; const uint queryNameLen = 32; char queryName[queryNameLen]; const uint counterNameLen = 32; char counterName[counterNameLen]; const uint counterDescLen = 256; char counterDesc[counterDescLen]; //get first vendor queryID glGetFirstPerfQueryIdINTEL(&queryId); nextQueryId = queryId; while(nextQueryId) { glGetPerfQueryInfoINTEL( nextQueryId, queryNameLen, &queryName, &dataSize, &noCounters, &noInstances, &capsMask); for(int counterId = 1; counterId <= noCounters; counterId++) { uint counterOffset; uint counterDataSize; uint counterTypeEnum; uint counterDataTypeEnum; UINT64 rawCounterMaxValue; glGetPerfCounterInfoINTEL( nextQueryId, counterId, counterNameLen, counterName, counterDescLen, counterDesc, &counterOffset, &counterDataSize, &counterTypeEnum, &counterDataTypeEnum, &rawCounterMaxValue); // use returned values here ... } } 2. Measuring a single draw call example Note that GL_QUERY_PIPELINE_METRICS is a proprietary structure defined by vendor and is used as example and function named according to the convention of glFuntionINTEL are wrappers to dynamically linked-by-name procedures. // query data has proprietary predefined structure layout // associated with the vendor query ID GL_QUERY_PIPELINE_METRICS * pQueryData; uint queryId; uint queryHandle; char queryName[] = "Intel_Pipeline_Query"; // get vendor queryID by name glGetPerfQueryIdByNameINTEL(queryName, &queryId); // create query instance of queryId type glCreatePerfQueryINTEL(queryId, &queryHandle); glBeginPerfQueryINTEL(queryHandle); // Start query glDrawElements(...); // Issue graphics commands, do whatever glEndPerfQueryINTEL(queryHandle); // End query // perform other application activities uint bytesWritten = 0; uint dataSize = sizeof(GL_QUERY_PIPELINE_METRICS); pQueryData = (GL_QUERY_PIPELINE_METRICS *) malloc(dataSize); // for the first time use GL_PERFQUERY_FLUSH_INTEL flag to ensure graphics // commands were submitted to hardware glGetPerfQueryDataINTEL( queryHandle, GL_PERFQUERY_FLUSH_INTEL, dataSize, pQueryData, &bytesWritten); while(bytesWritten == 0) { // Now enough to use GL_PERFQUERY_DONOT_FLUSH_INTEL flag glGetPerfQueryDataINTEL( queryHandle, GL__PERFQUERY_DONOT_FLUSH_INTEL, dataSize, pQueryData, &bytesWritten); } if(bytesWritten == dataSize) { // Use counters' data here uint64 vertexShaderKernelsRunCount = pQueryData->VertexShaderInvocations; uint64 fragmentShaderKernelsRunCount = pQueryData->FragmentShaderInvocations; ... } else { // error handling case } glDeletePerfQueryINTEL(queryHandle); // query instance is released 3. Measuring multiple draw calls with synchronous wait for result Note that GL_QUERY_HD_HW_METRICS is a proprietary structure defined by vendor and is used as example and function named according to the convention of glFuntionINTEL are wrappers to dynamically linked-by-name procedures. // query data has proprietary predefined structure layout // associated with the vendor query ID GL_QUERY_HD_HW_METRICS * pQueryData; uint queryId; UINT32 queryHandle[1000]; char queryName[] = "Intel_HD_Hardware_Counters"; // get vendor queryID by name glGetPerfQueryIdByNameINTEL(queryName, &queryId); // create memory for 1000 results uint dataSize = sizeof(GL_QUERY_HD_HW_METRICS); pQueryData = (GL_QUERY_HD_HW_METRICS *) malloc(dataSize * 1000); // create 1000 query instances of queryId type for(int i = 0; i < 1000; i++) { glCreatePerfQueryINTEL(queryId, &queryHandle[i]); } uint currentDrawNumber = 0; // start 1st query glBeginPerfQueryINTEL(queryHandle[currentDrawNumber]); glDrawElements(...); // Issue graphics commands // end query glEndPerfQueryINTEL(queryHandle[currentDrawNumber++]); ... // start nth query glBeginPerfQueryINTEL(queryHandle[currentDrawNumber]); glDrawElements(...); // Issue graphics commands // end query glEndPerfQueryINTEL(queryHandle[currentDrawNumber++]); ... // assume currentDrawNumber == 1000 here // so get all results after these 1000 draws GL_QUERY_HD_HW_METRICS *pData = pQueryData; for(int i = 0; i < 1000; i++) { uint bytesWritten = 0; // use GL_PERFQUERY_WAIT_INTEL flag to cause the function will wait // for the query completion glGetPerfQueryDataINTEL( queryHandle[i], GL_PERFQUERY_WAIT_INTEL, dataSize, pData, &bytesWritten); if(bytesWritten != sizeof(GL_QUERY_HD_HW_METRICS)) { // query error case assert(false); ... // some cleanup needed also ... return ERROR; } pData++; } // use counters data ... // repeat measurements if needed reusing the query instances ... // query instances are no longer needed so release all of them for(int i = 0; i < 1000; i++) { glDeletePerfQueryINTEL(queryHandle[i]); } return SUCCESS; Revision History 1.3 20/12/13 Jon Leech Assign extension #s and enum values. Fix a few typos (Bug 11345). 1.2 29/11/13 sgrajewski Extension upgraded to 4.4 core specification. ES3.0.2 dependencies added. 1.1 06/06/11 puminski Initial revision.