The designers of certain graphics cards / GPUs, have decided that Render-To-Vertex-Buffer is deprecated. In order to appreciate why I believe this to be a mistake, the reader first needs to know what R2VB is – or was.
The rendering pipeline of DirectX 9 versus DirectX 11 is somewhat different, yet also very similar, and DirectX 9 was extremely versatile, with a wide range of applications written that use it, while the fancier Dx 11 pipeline is more powerful, but has less of an established base of algorithms.
Dx 9 is approximated in OpenGL 2, while Dx 10 and Dx 11 are approximated in OpenGL 3(+) .
Approximately, the rendering pipelines are like this:
- An index array is sampled, such that one element consist entirely of 1, 2, 3, 4, or 6 vertex-numbers. These represent the format, with which a 3D model stores a basic gemoetry.
- A vertex array is sampled, according to the vertex-numbers produced by the index array. Each vertex structure contains a set of attributes, which was already somewhat flexible with Dx 9, because Dx 9 already allowed for Vertex Programs / Vertex Shaders to parse them.
- A Vertex program / Shader is run once per vertex, that inputs one set of vertex attributes, and potentially outputs a completely different set of vertex attributes. The eventual output needs to be recognized by the rasterizer however, because the outputs typically need to be interpolated, to correspond to a triangle on the screen. I.e., it is typically this output which transforms the vertex coordinates from model-space into screen-space / clip-space.
- (In the case of Dx 11 and above) Two separate shader stages can now optionally act as highly-optimized Tessellators , subdividing each input topology, into a patch of output topologies, that are equivalent to what the vertex program would have output directly, in Dx 9, and the positions of which follow entirely from the positions, of Points they take as input, without customizations in shader code. I.e., the Hull Shader subdivides, and the Domain Shader assigns new attributes, defined by a single control function.
- (In the case of Dx 10 and above) An optional Geometry Shader can now accept a set of input Points, with one out of several possible topologies, and is very flexible in the number of Points it outputs. Their attributes can be derived by complex, customized shader code. Yet, the header information for this GS states what the output topology must be. If that states ‘triangles’, the number of output Points must be a multiple of 3. If that states a (much-favored) ‘triangle strip’, then 3 or more Points must be output at a time, for each triangle strip. Again, the data of the GS stage is equivalent to the data which the VS would have output with Dx 9, especially in that finally, the Points must end up in clip-space, before being sent to the Fragment Shader…
- In all cases, a Rasterizer does its work on a Point-Sprite, or a Triangle, etc., rasterizing each such topology, interpolating the data output before, behind the scenes, and then feeding the interpolated data to the Fragment Shader, not exactly, but in essence, once per screen pixel.
- In all cases, the equivalent of a Fragment Shader processes one pixel, which is more truly one fragment, at a time, in order to arrive solely at pixel values for the screen, or for another texture rendered to.
In cases with Unified Shader Model, a single GPU core, of which a modern graphics card may have hundreds, can fulfill any one of the roles from [1 … 5] or  above, and can include the ability of texel-fetch, regardless of whether this is for a VS, a GS, or an FS. Prior to USM, certain cores could only act either to run a VS or an FS, and had non-matching abilities. Back then, those were called ‘Vertex Pipelines’ and ‘Pixel Pipelines’ and were very limited in number.
(Edit 08/14/2017 : What I mean to say, is that the index array and the vertex array are located in graphics memory, just as the 2D texture images are, but that the mentioned arrays form separate data-structures, which are officially referred to as an index-buffer and a vertex-buffer, and the GPU reads them, while the CPU loaded them into graphics memory by default.
Further, unless the geometry is quite complex, as it would be for high-end gaming, the geometry-data takes up fewer megabytes than texture images normally do, and the index-buffer usually takes up fewer megabytes, than the vertex-buffer does, both defining 3D geometry. )
As the reader can see, DirectX 9 lacked any sort of Geometry Shader support per se. But what all these pipelines have in common, is the fact that they can render their output to a texture image, rather than to the actual screen, which by Dx 11 has become the default, while for Dx 9, rendering directly to the screen was the default. The format of pixels rendered to was variable and set in header information belonging to the FS. Also, the vertex array read by the VS had such a variable format, for its single vertex structures as elements.
A trick which was used in the development of DirectX 9 applications, was to select an output format for Screen Pixels, that exactly matched the input format, for a certain type of Vertex Structure.
Because both the buffer from which vertices are read in, and a buffer for pixels output, can be set by the application via the CPU, not the GPU, it was possible to tell one rendering stage that the pixels output by the FS of a previous stage, are in fact vertices. And then the VS of the later stage will receive vertices, that were generated by the GPU, via a Fragment Shader of an earlier rendering stage.
This is called Render-To-Vertex-Buffer, and it allowed DirectX 9 applications to implement a kind of poor-mans Geometry Shader, by coding a Fragment Shader in fact, producing output bit-sets that do not correspond to sensible Screen Pixels, but which contain sensible data, when parsed as Vertex Structures.
So a Dx 9 application could have a de-facto Geometry Shader, even though the specification did not specifically provide for one to be defined.
In numerous postings on this blog, I have written that I have compiled various versions, of the open-source graphics rendering engine named OGRE. Its shader-libraries include an example of an Iso-Surface. And Iso-Surface is a kind of output geometry, ostensibly coming from a Geometry Shader, which either implements the marching cubes approach, or the marching tetrahedra, in order to fit a mesh to the density-boundary of a defined volume.
It can happen in graphics, that we begin with either a volume-texture, or with a density-function that derives from X, Y and Z coordinates as parameters, and that we need to give this data a mesh, specifically so that the mesh can possess normal-vectors, and can reflect lighting-vectors correctly. Volumes do not tend to have this feature.
The Iso-Surface example and library that ship with OGRE, run under DirectX 9 and OpenGL 2. They do not require DirectX 10 or OpenGL 3 by default, and will in fact not run with those engines, because they use R2VB rather than a formal GS.
On one of my graphics cards (the GeForce GTX460), the Iso-Surface sample runs fine, while on another (an ATI / AMD), a clean error message given by OGRE states, ‘R2VB is not supported with the present rendering engine. Sorry.’ This happens, regardless of which OGRE version I have compiled, and depends purely on which computer is running the sample.
It is actually next-to-impossible, for game-devs and application-devs, to define their own OGRE-shaders from scratch, because the way shaders are loaded and supported, is integrated too closely with how OGRE itself works. We are bound to using shaders from the libraries in practice, and the libraries are extensive. But nobody ever designed an Iso-Surface implementation for OGRE, that works with DirectX 11 or GL 3.
So on a machine that supports R2VB, I am good to go, but on a machine that does not, I am borked.
It is not within my power, to convince OGRE devs, to come up with an Iso-Surface-Implementation, that is up-to-date with how a modern GS works, because they have too much inertia, and because too much work has already been invested into existing libraries.
It would cost the graphics-card manufacturer next-to-nothing, to enable R2VB. But they simply decided, it was deprecated.
Note: On a modern GPU with hundreds of cores, those are organized into groups, which typically number in the vicinity of 8. In each group, a fixed number of Render Output Generators, such as possibly 4 of them, together with the logic circuits, coordinate a number of the USM cores, in order to implement the rendering pipeline, but render output generators themselves are not thought to be highly programmable.
The reason we might need 4 of them in 1 group, is the possibility that our Render-To-Texture stages could be nested 4 deep. I.e, the output of one stage could form the input of a 2nd, the output of which could form the input of a 3rd, the output of which could form the input of a 4th stage. This needs to be done with particle-based fluids, which are rendered to a depth-buffer as Point-Sprites, but the output of which needs to be smoothed, before a deferred stage renders the fluid as 1 surface, which may reflect a background scene or refract it… The correct solution to particle-fluids, is deferred rendering…
And, because bucket-rendering is supported on modern GPUs at the hardware-level, it is common that multiple FS cores (belonging to one group) are assigned to render 1 model, where only 1 VS core needs to be assigned. This speeds up rendering to high-res screens. Contrarily to what the WiKi states however, I have found in my own experience, that the screen-space is subdivided into narrow, vertical bands, X-many across, but only 1 from top-to-bottom.
If the GPU is being used for numerical computation instead of graphics output, via OpenCL or CUDA, then each available Shader Core Group is also referred to as one vector-processor, but in this capacity, the total number of them does not always equal the total number available for graphics output.
Further, the output from several groups needs to be routed or combined, before the CPU or the display can make use of it, so that additional ‘render-output-generators’ can exist on a GPU, which do not belong to a specific group.
By default, a vertex structure consists of a fixed number of numeric fields, and header information states whether these are supposed to be 32-bit floating-point numbers, or integers, and how many of them exist. Any code running as a VS determines in its usage, what these fields represent. For graphics, a standard set would include
- (Always) Vertex Position, in 3 model-space coordinates,
- (If Not a Point-Sprite) Normal-Vector,
- (Optionally) Per-Vertex parallax-mapping, that complements the Normal-Vector (Tangent-Vectors in floating-point),
- Per-Vertex U,V texture-mapping for the model geometry,
- (Optionally) Additional texture-mappings, such as for static light-maps, terrain, etc.. Some of these could also be U,V,W-mappings, into 3D-textures,
- (Optionally) Per-Vertex Color, including R, G, B, A,
- (Optionally) Bone 1 and Bone 2 (Integers),
- (…) Bone 1 and Bone 2 blend-weights (floating-point)…
The registers of the GPU are generally organized as 4, 32-bit fields each, and there is a set minimum number which any GPU must possess to be compliant. But certain specific registers are set aside by the rasterization scheme, also to act as interpolators for floating-point values, if a certain flag is set. These interpolation units tend to be agnostic, to whether they are mapping U,V Texture-Coordinates, or anything else, because all the interpolations take place separately for each channel passed along.
A more recent development which also exists, is double-width GPU-registers, that can work on sets of 8 such fields at once, instead of on sets of 4.
A pixel also consists of a fixed number of numerical fields, the bit-format of which is defined by header information, and which can be made to match that of some vertex-structures.
The 4×4 matrices which a GPU can perform vector-computations with in a highly-optimized way, are represented as a series of 4 standard registers. A 3×3 matrix is usually implemented as a 4×4, with the exception that their 4th field is not used, because the 4th value stored in a vector is set to (0.0), when a 4-component vector is cast down to a 3-component. When a 3-compoenent vector is cast the other way, to a 4-component, explicit code specifies what the 4th field should be set to (usually 1.0) .
When a 4×4 is identified in low-level code, this is done by way of its first register-number out of 4. Naturally, 3×3 only needs to consume 3 consecutive registers, because anything the GPU is optimized to do with it, will only output a 3-component vector formally; the definition of the unused 4th output-component, is zero. Otherwise, it would follow from the 4th row of a matrix.
(Edit 01/10/2017 : ) One subject which has already been acknowledged elsewhere, but which deserves mention again here, is the way in which GPU-based graphics handles normal-vectors, differently from Mathematical fact.
According to Math, each triangle is capable of having a normal-vector. But at the edges and vertices, because this normal-vector represents a derivative in 3D, there would be a step in the normal-vector, making its function non-continuous, and for certain purposes undefined.
Because on the GPU, each triangle is only defined in the index-array as a combination of 3 vertex-numbers, this approach has no place to store its true normal-vector. OTOH, each vertex can have a complex structure within the vertex-array, with capacity to store a normal-vector.
So in general what happens, is that the CPU goes through all the triangles of the mesh and computes each true normal-vector, according to Math which is established. Each normal-vector can simply be the cross-product between two edges, but normalized. Then, this normal-vector is added into an accumulator for each vertex, as well as a count, of how many triangles have so far contributed to the value per vertex.
After that pass, another pass will scalar-multiply the accumulated normal-vector stored with each vertex, by the reciprocal of the number of summations, resulting in an averaged vector.
So the oddity which follows, is that for the fragments that reside within the triangle, the normal-vectors are interpolated, from fictitious ones that have been stored for the vertices, as part of the model definition.
This works well in the case of smoothed surfaces. But sometimes we would like for the normal-vectors not to be smoothed. In fact, we might want to software-render a model in which a subset of the triangles are not smoothed, where we have fine control over the rendering system, but we might be using a 3D model editor which displays a version of the model in real-time, which is based on GPU-rendering, and which acts as a preview for our editing operations.
And so a low-level representation of the model must be derived again, from the high-level representation, in which only a subset of triangles is smoothed, but in which all the triangles can be GPU-rendered.
I believe the way this usually works, is that some of the low-level vertices sent to the GPU are repeated, once as belonging to each adjoining face, that is not smoothed. So there will be a complete definition of each face, in which the generated vertex has a normal-vector equal to that of the face. And then, nothing forbids vertices belonging to separate faces simply occupying the same position in 3D.
So, while according to common-sense, a cube has 8 vertices, if it is to be GPU-rendered without smoothing, it will receive 24 fictitious vertices, 4 per rectangular face, and at each real vertex-position, 3 vertices belonging to the low-level representation will coincide. Each will be sent to the GPU as having a different normal-vector.
Similarly, while the GPU only understands triangles, it is common that graphics software works with quads. The use of quads is simpler to code for, as my attempts at creating tessellators in previous postings demonstrate. And so this difference is often satisfied, by the CPU drawing a diagonal line through each quad, resulting in two triangles.
This software-based tessellation is unrelated to the hardware-pipeline-stages above, which tessellate triangles that have already been sent to the GPU, and which were used at some point to define the geometry statically.
(Update 3/10/2020, 5h35 : )
There is also an informative article about CUDA Cores at the following link: