Name NV_shader_thread_shuffle Name Strings GL_NV_shader_thread_shuffle Contributors Jeannot Breton, NVIDIA Pat Brown, NVIDIA Eric Werness, NVIDIA Mark Kilgard, NVIDIA Contact Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com) Status Shipping. Version Last Modified Date: 2/14/2014 NVIDIA Revision: 3 Number OpenGL Extension #448 Dependencies This extension is written against the OpenGL 4.3 (Compatibility Profile) Specification. This extension is written against version 4.30 (revision 07) of the OpenGL Shading Language Specification. OpenGL 4.3 and GLSL 4.3 are required. This extension interacts with NV_gpu_program5 Overview Implementations of the OpenGL Shading Language may, but are not required, to run multiple shader threads for a single stage as a SIMD thread group, where individual execution threads are assigned to thread groups in an undefined, implementation-dependent order. This extension provides a set of new features to the OpenGL Shading Language to share data between multiple threads within a thread group. Shaders using the new functionalities provided by this extension should enable this functionality via the construct #extension GL_NV_shader_thread_shuffle : require (or enable) This extension also specifies some modifications to the program assembly language to support the thread data sharing functionalities. New Procedures and Functions None New Tokens None Modifications to The OpenGL Shading Language Specification, Version 4.30 (Revision 07) Including the following line in a shader can be used to control the language features described in this extension: #extension GL_NV_shader_thread_shuffle : where is as specified in section 3.3. New preprocessor #defines are added to the OpenGL Shading Language: #define GL_NV_shader_thread_shuffle 1 Modify Section 8.3, Common Functions, p. 133 (add a function to share data between threads in a thread group) Syntax: int shuffleDownNV(int data, uint index, uint width, [out bool threadIdValid]) ivec2 shuffleDownNV(ivec2 data, uint index, uint width, [out bool threadIdValid]) ivec3 shuffleDownNV(ivec3 data, uint index, uint width, [out bool threadIdValid]) ivec4 shuffleDownNV(ivec4 data, uint index, uint width, [out bool threadIdValid]) uint shuffleDownNV(uint data, uint index, uint width, [out bool threadIdValid]) uvec2 shuffleDownNV(uvec2 data, uint index, uint width, [out bool threadIdValid]) uvec3 shuffleDownNV(uvec3 data, uint index, uint width, [out bool threadIdValid]) uvec4 shuffleDownNV(uvec4 data, uint index, uint width, [out bool threadIdValid]) float shuffleDownNV(float data, uint index, uint width, [out bool threadIdValid]) vec2 shuffleDownNV(vec2 data, uint index, uint width, [out bool threadIdValid]) vec3 shuffleDownNV(vec3 data, uint index, uint width, [out bool threadIdValid]) vec4 shuffleDownNV(vec4 data, uint index, uint width, [out bool threadIdValid]) bool shuffleDownNV(bool data, uint index, uint width, [out bool threadIdValid]) bvec2 shuffleDownNV(bvec2 data, uint index, uint width, [out bool threadIdValid]) bvec3 shuffleDownNV(bvec3 data, uint index, uint width, [out bool threadIdValid]) bvec4 shuffleDownNV(bvec4 data, uint index, uint width, [out bool threadIdValid]) int shuffleUpNV(int data, uint index, uint width, [out bool threadIdValid]) ivec2 shuffleUpNV(ivec2 data, uint index, uint width, [out bool threadIdValid]) ivec3 shuffleUpNV(ivec3 data, uint index, uint width, [out bool threadIdValid]) ivec4 shuffleUpNV(ivec4 data, uint index, uint width, [out bool threadIdValid]) uint shuffleUpNV(uint data, uint index, uint width, [out bool threadIdValid]) uvec2 shuffleUpNV(uvec2 data, uint index, uint width, [out bool threadIdValid]) uvec3 shuffleUpNV(uvec3 data, uint index, uint width, [out bool threadIdValid]) uvec4 shuffleUpNV(uvec4 data, uint index, uint width, [out bool threadIdValid]) float shuffleUpNV(float data, uint index, uint width, [out bool threadIdValid]) vec2 shuffleUpNV(vec2 data, uint index, uint width, [out bool threadIdValid]) vec3 shuffleUpNV(vec3 data, uint index, uint width, [out bool threadIdValid]) vec4 shuffleUpNV(vec4 data, uint index, uint width, [out bool threadIdValid]) bool shuffleUpNV(bool data, uint index, uint width, [out bool threadIdValid]) bvec2 shuffleUpNV(bvec2 data, uint index, uint width, [out bool threadIdValid]) bvec3 shuffleUpNV(bvec3 data, uint index, uint width, [out bool threadIdValid]) bvec4 shuffleUpNV(bvec4 data, uint index, uint width, [out bool threadIdValid]) int shuffleXorNV(int data, uint index, uint width, [out bool threadIdValid]) ivec2 shuffleXorNV(ivec2 data, uint index, uint width, [out bool threadIdValid]) ivec3 shuffleXorNV(ivec3 data, uint index, uint width, [out bool threadIdValid]) ivec4 shuffleXorNV(ivec4 data, uint index, uint width, [out bool threadIdValid]) uint shuffleXorNV(uint data, uint index, uint width, [out bool threadIdValid]) uvec2 shuffleXorNV(uvec2 data, uint index, uint width, [out bool threadIdValid]) uvec3 shuffleXorNV(uvec3 data, uint index, uint width, [out bool threadIdValid]) uvec4 shuffleXorNV(uvec4 data, uint index, uint width, [out bool threadIdValid]) float shuffleXorNV(float data, uint index, uint width, [out bool threadIdValid]) vec2 shuffleXorNV(vec2 data, uint index, uint width, [out bool threadIdValid]) vec3 shuffleXorNV(vec3 data, uint index, uint width, [out bool threadIdValid]) vec4 shuffleXorNV(vec4 data, uint index, uint width, [out bool threadIdValid]) bool shuffleXorNV(bool data, uint index, uint width, [out bool threadIdValid]) bvec2 shuffleXorNV(bvec2 data, uint index, uint width, [out bool threadIdValid]) bvec3 shuffleXorNV(bvec3 data, uint index, uint width, [out bool threadIdValid]) bvec4 shuffleXorNV(bvec4 data, uint index, uint width, [out bool threadIdValid]) int shuffleNV(int data, uint index, uint width, [out bool threadIdValid]) ivec2 shuffleNV(ivec2 data, uint index, uint width, [out bool threadIdValid]) ivec3 shuffleNV(ivec3 data, uint index, uint width, [out bool threadIdValid]) ivec4 shuffleNV(ivec4 data, uint index, uint width, [out bool threadIdValid]) uint shuffleNV(uint data, uint index, uint width, [out bool threadIdValid]) uvec2 shuffleNV(uvec2 data, uint index, uint width, [out bool threadIdValid]) uvec3 shuffleNV(uvec3 data, uint index, uint width, [out bool threadIdValid]) uvec4 shuffleNV(uvec4 data, uint index, uint width, [out bool threadIdValid]) float shuffleNV(float data, uint index, uint width, [out bool threadIdValid]) vec2 shuffleNV(vec2 data, uint index, uint width, [out bool threadIdValid]) vec3 shuffleNV(vec3 data, uint index, uint width, [out bool threadIdValid]) vec4 shuffleNV(vec4 data, uint index, uint width, [out bool threadIdValid]) bool shuffleNV(bool data, uint index, uint width, [out bool threadIdValid]) bvec2 shuffleNV(bvec2 data, uint index, uint width, [out bool threadIdValid]) bvec3 shuffleNV(bvec3 data, uint index, uint width, [out bool threadIdValid]) bvec4 shuffleNV(bvec4 data, uint index, uint width, [out bool threadIdValid]) Shuffle functions allow active threads within a thread group to exchange data using 4 different modes (up, down, xor, indexed). They all load the operand which can be different per thread and return a value read from the source thread at an address computed with the and the operands. is a 5 bits value in the range 0 to 31, MSBs are ignored. is an optional operand. It hold the value of the predicate that specifies if the source thread from which the current thread reads data is in range or not. is used for segmenting the thread group in multiple segments. The segments need to be subdivided equally, so needs to be a power of 2 in the range 2 to 32. Using a of 32 would divide the thread group in a single segment. A of 8 would divide the thread group in 4 segments of size 8. Using a that is not a power of 2, that is lower than 2 or larger than 32 will return an undefined value. Threads can only share data within their own segment. Each thread executing the built-in shuffle function will determine the ID of another thread by combining its value of gl_ThreadInWarpNV with its value of as described below. Such threads will attempt to read the value of in the computed other thread and return that value to the caller. When a shuffle function attempts to access the value of from another thread, it determines whether the other thread is in accessible range or not. If it is in range, true will be returned in the optional parameter, if provided by the caller. If it's out of range, false will be returned in , if provided by the caller, and the value returned by the function will come from the current thread. The 4 modes use the following logic to compute the source thread index and the value: shuffleNV computes the source index using as an absolute address within the thread group segment. srcThreadId = = < For example, with this thread group segment: ----------------- Thread Id |0|1|2|3|4|5|6|7| ----------------- Thread |a|b|c|d|e|f|g|h| ----------------- If is 2 ----------------- src thread Id |2|2|2|2|2|2|2|2| ----------------- |1|1|1|1|1|1|1|1| ----------------- result |b|b|b|b|b|b|b|b| ----------------- If is 9 ----------------- src thread Id |9|9|9|9|9|9|9|9| ----------------- |0|0|0|0|0|0|0|0| ----------------- result |a|b|c|d|e|f|g|h| ----------------- shuffleUpNV subtracts from the current thread id to get the source thread id. This have the effect of shifting up the segment by threads. Source thread id do not wrap around, so lower thread id will be left unchanged. srcThreadId = currentThreadId - = srcThreadId >= 0 For example, with this thread group segment: ----------------- Thread Id |0|1|2|3|4|5|6|7| ----------------- Thread |a|b|c|d|e|f|g|h| ----------------- If is 1 ------------------ src thread Id |-1|0|1|2|3|4|5|6| ------------------ |0 |1|1|1|1|1|1|1| ------------------ result |a |a|b|c|d|e|f|g| ------------------ shuffleDownNV adds to the current thread id to get the source thread id. This have the effect of shifting down the segment by threads. Source thread id do not wrap around, so higher thread id will be left unchanged. srcThreadId = currentThreadId + = srcThreadId < For example, with this thread group segment: ----------------- Thread Id |0|1|2|3|4|5|6|7| ----------------- Thread |a|b|c|d|e|f|g|h| ----------------- If is 2 ----------------- src thread Id |2|3|4|5|6|7|8|9| ----------------- |1|1|1|1|1|1|0|0| ----------------- result |c|d|e|f|g|h|g|h| ----------------- shuffleXorNv does a bitwise xor between the and the current thread id to get the src thread id: srcThreadId = currentThreadId ^ = srcThreadId < For example, with this thread group segment: ----------------- Thread Id |0|1|2|3|4|5|6|7| ----------------- Thread |a|b|c|d|e|f|g|h| ----------------- If is 0x1 ----------------- src thread Id |1|0|3|2|5|4|7|6| ----------------- |1|1|1|1|1|1|1|1| ----------------- result |b|a|d|c|f|e|h|g| ----------------- Dependencies on NV_gpu_program5 If NV_gpu_program5 is supported and "OPTION NV_shader_thread_shuffle" is specified in an assembly program, the following edits are made to extend the assembly programming model documented in the NV_gpu_program4 extension and extended by NV_gpu_program5. If NV_gpu_program5 is not supported, or if "OPTION NV_shader_thread_shuffle" is not specified in an assembly program, the contents of this dependencies section should be ignored. Section 2.X.2, Program Grammar (add the following rules to the grammar) ::= "SHFDOWN" | "SHFIDX" | "SHFUP" | "SHFXOR" Modify Section 2.X.4, Program Execution Environment (Add the table entries and relevant text describing the program instructions to exchange data between threads.) Instr- Modifiers uction V F I C S H D Out Inputs Description ------- -- - - - - - - --- -------- -------------------------------- ... SHFDOWN 50 X X - - - - F v v,vu,vu warp shuffle with added index SHFIDX 50 X X - - - - F v v,vu,vu warp shuffle with absolute index SHFUP 50 X X - - - - F v v,vu,vu warp shuffle with subtracted index SHFXOR 50 X X - - - - F v v,vu,vu warp shuffle with XORed index ... (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, as extended by NV_gpu_program5) + Shader thread shuffle (NV_shader_thread_shuffle) If a program specifies the "NV_shader_thread_shuffle" option, it may use the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions. If this option is not specified, a program will fail to compile if it uses those instructions. Section 2.X.8.Z, SHFDOWN: warp shuffle with added index The SHFDOWN instruction allows a 32-bit scalar value to be exchanged between multiple thread within a thread group. The instruction has 3 operands as input. The first operand is a 32-bit scalar. This value will be shared between thread, it can be a float, a signed or an unsigned integer. The second operand is an unsigned integer index in the range 0 to 31. It is used to compute from which thread the current thread will read the 32-bit scalar value. For the SHFDOWN instruction this source thread is the id of the current thread added with the index operand. The last operand is an unsigned integer mask. The mask is used for segmenting the thread group and limiting the source thread index. Bits 0 to 4 of are a clamp value that limits the source thread index and bits 8 to 12 a segmentation mask used to segment the thread group in multiple smaller groups. Together the clamp value and the segmentation mask will generate 2 internal values, the minThreadId and the maxThreadId, using the following logic: minThreadId = current thread id & segmentationMask maxThreadId = minThreadId | (clamp & ~segmentationMask) Those 2 values will segment the thread group by restricting the address range a specific thread can access. SHFDOWN returns a 2-component vector. The first component is a predicate that is TRUE when the computed source thread id is in range and FALSE when it's out of bounds. For SHFDOWN, the source thread id is in range when it is lower than maxThreadId. The second component holds a 32-bit value. When the source thread id is in range, this value comes from the source thread. When the source thread id is out of range, it read the value from the current thread. If the source thread id reference to an inactive thread, the returned result will be undefined. SHFDOWN supports all data type modifiers. For floating-point data types, the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer data types, the TRUE value is the maximum integer value (all bits are ones) and the FALSE value is zero. Section 2.X.8.Z, SHFIDX: warp shuffle with absolute index The SHFIDX instruction allows a 32-bit scalar value to be exchanged between multiple thread within a thread group. The instruction has 3 operands as input. The first operand is a 32-bit scalar. This value will be shared between thread, it can be a float, a signed or an unsigned integer. The second operand is an unsigned integer index in the range 0 to 31. It is used to compute from which thread the current thread will read the 32-bit scalar value. For the SHFIDX instruction, this source thread id is computed using the following operation: source thread id =( index operand & ~segmentationMask) | minThreadId The last operand is an unsigned integer mask. The mask is used for segmenting the thread group and limiting the source thread index. Bits 0 to 4 of are a clamp value that limits the source thread index and bits 8 to 12 a segmentation mask used to segment the thread group in multiple smaller groups. Together the clamp value and the segmentation mask will generate 2 internal values, the minThreadId and the maxThreadId, using the following logic: minThreadId = current thread id & segmentationMask maxThreadId = minThreadId | (clamp & ~segmentationMask) Those 2 values will segment the thread group by restricting the address range a specific thread can access. SHFIDX returns a 2-component vector. The first component is a predicate that is TRUE when the computed source thread id is in range and FALSE when it's out of bounds. For SHFIDX, the source thread id is in range when it is lower than maxThreadId. The second component holds a 32-bit value. When the source thread id is in range, this value comes from the source thread. When the source thread id is out of range, it read the value from the current thread. If the source thread id reference to an inactive thread, the returned result will be undefined. SHFIDX supports all data type modifiers. For floating-point data types, the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer data types, the TRUE value is the maximum integer value (all bits are ones) and the FALSE value is zero. Section 2.X.8.Z, SHFUP: warp shuffle with subtracted index The SHFUP instruction allows a 32-bit scalar value to be exchanged between multiple thread within a thread group. The instruction has 3 operands as input. The first operand is a 32-bit scalar. This value will be shared between thread, it can be a float, a signed or an unsigned integer. The second operand is an unsigned integer index in the range 0 to 31. It is used to compute from which thread the current thread will read the 32-bit scalar value. For the SHFUP instruction this source thread is the id of the current thread subtracted with the index operand. The last operand is an unsigned integer mask. The mask is used for segmenting the thread group and limiting the source thread index. Bits 0 to 4 of are a clamp value that limits the source thread index and bits 8 to 12 a segmentation mask used to segment the thread group in multiple smaller groups. Together the clamp value and the segmentation mask will generate 2 internal values, the minThreadId and the maxThreadId, using the following logic: minThreadId = current thread id & segmentationMask maxThreadId = minThreadId | (clamp & ~segmentationMask) Those 2 values will segment the thread group by restricting the address range a specific thread can access. SHFUP returns a 2-component vector. The first component is a predicate that is TRUE when the computed source thread id is in range and FALSE when it's out of bounds. For SHFUP, the source thread id is in range when it is greater than maxThreadId. The second component holds a 32-bit value. When the source thread id is in range, this value comes from the source thread. When the source thread id is out of range, it read the value from the current thread. If the source thread id reference to an inactive thread, the returned result will be undefined. SHFUP supports all data type modifiers. For floating-point data types, the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer data types, the TRUE value is the maximum integer value (all bits are ones) and the FALSE value is zero. Section 2.X.8.Z, SHFXOR: warp shuffle with XORed index The SHFXOR instruction allows a 32-bit scalar value to be exchanged between multiple threads within a thread group. The instruction has 3 operands as input. The first operand is a 32-bit scalar. This value will be shared between threads, it can be a float, a signed or an unsigned integer. The second operand is an unsigned integer index in the range 0 to 31. It is used to compute from which thread the current thread will read the 32-bit scalar value. For the SHFXOR instruction this source thread is the id of the current thread XORed with the index operand. The last operand is an unsigned integer mask. The mask is used for segmenting the thread group and limiting the source thread index. Bits 0 to 4 of are a clamp value that limits the source thread index and bits 8 to 12 a segmentation mask used to segment the thread group in multiple smaller groups. Together the clamp value and the segmentation mask will generate 2 internal values, the minThreadId and the maxThreadId, using the following logic: minThreadId = current thread id & segmentationMask maxThreadId = minThreadId | (clamp & ~segmentationMask) Those 2 values will segment the thread group by restricting the address range a specific thread can access. SHFXOR returns a 2-component vector. The first component is a predicate that is TRUE when the computed source thread id is in range and FALSE when it's out of bounds. For SHFXOR, the source thread id is in range when it is lower than maxThreadId. The second component holds a 32-bit value. When the source thread id is in range, this value comes from the source thread. When the source thread id is out of range, it read the value from the current thread. If the source thread id reference to an inactive thread, the returned result will be undefined. SHFXOR supports all data type modifiers. For floating-point data types, the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer data types, the TRUE value is the maximum integer value (all bits are ones) and the FALSE value is zero. Errors None. New State None. New Implementation Dependent State None. Issues None Revision History Rev. Date Author Changes ---- -------- -------- ----------------------------------------- 3 2/14/14 jbreton Rename the extension from NVX to NV. 2 9/4/13 jbreton Replace mask by width in the shuffle functions. 1 11/27/12 jbreton Internal revisions.