Post Transform Cache: Difference between revisions

From OpenGL Wiki
Jump to navigation Jump to search
Hoppe algorithm
 
(5 intermediate revisions by 3 users not shown)
Line 3: Line 3:
== Functionality ==
== Functionality ==


Vertex processing with [[Vertex Shader|vertex shaders]] is a very strict process. A single set of [[Vertex Attributes|vertex attributes]] enter the vertex shader, and a single set of post-transformed data comes out. The output of this stage is based ''solely'' on the inputs (since [[GLSL Uniforms|uniforms]] cannot change within a rendering call). Therefore, if you can detect that you have received the same vertex attribute inputs, you do not have to do vertex processing at all. Instead, if the outputs for that input attribute set is in the cache, the cached data can be used. This saves vertex processing.
Vertex processing with [[Vertex Shader|vertex shaders]] is a very strict process. A single set of [[Vertex Attribute]]s enter the vertex shader, and a single set of post-transformed data comes out. For any particular rendering call, the output of this stage is based ''solely'' on the inputs (since [[Uniform (GLSL)|Uniform]]s cannot change within a rendering call). Therefore, if the system can detect that you have passed the same vertex attribute inputs, then the system can avoid executing the vertex shader again. Instead, if the outputs for that input attribute set is in the cache, the cached data can be used. This saves vertex processing.


In the absolute best case, you never have to process the same vertex more than once.
In the absolute best case, you never have to process the same vertex more than once.


The key for testing whether a vertex has been processed before is the index of that vertex. Therefore, when doing [[Vertex Specification|non-indexed rendering]], you have no access to the post transform cache.
The test for whether a vertex is the same as a previous one is somewhat indirect. It would be impractical to test all of the user-defined attributes for inequality. So instead, a different means is used.


If the index for the vertex is in the post transform cache, then that vertex data is not even read from the stream again. It skips the entire read and vertex processing steps, and simply adds another copy of that vertex's post-transform data to the output stream for primitive assembly.
Two vertices are considered equal (within a single rendering command) if the vertex's index and instance count are the same ({{code|gl_VertexID}} and {{code|gl_InstanceID}} in the shader). Since vertices for non-indexed rendering are always increasing, it is not possible to use the post transform cache with non-indexed rendering.
 
If the vertex is in the post transform cache, then that vertex data is not necessarily even read from the input vertex arrays again. The process skips the read and vertex shader execution steps, and simply adds another copy of that vertex's post-transform data to the output stream.


=== Cache Size ===
=== Cache Size ===


As with any memory buffer, there is a maximum size to the cache. In the early days of post transform caches, when they used fixed-function pipelines and not generic vertex attributes, the size of the cache was measured in the number of vertices it could store. In current days, since the format of a vertex is generic and variable, the caches are more traditional memory buffers. Thus the number of vertices that can be stored in the post transform cache nowadays depends on how many outputs you write from your vertex shader.
As with any memory buffer, there is a maximum size to the post transform cache. In the early days of post transform caches, when they used fixed-function pipelines and not generic vertex attributes, the size of the cache was measured in the number of vertices it could store. In current days, since the format of a vertex is generic and variable, the caches are more traditional memory buffers. Thus the number of vertices that can be stored in the post transform cache nowadays depends on how many outputs you write from your vertex shader.


Even so, you can expect vertex shader-based hardware to allow for a fairly large number of vertices, on the order of 20+ at least.
Even so, you can expect vertex shader-based hardware to allow for a fairly large number of vertices, on the order of 20+ at least.


The size of the post transform cache can have an affect on how you optimize your triangles. If you optimize your mesh for a large number of vertices in the cache, this mesh may get poor post transform cache behavior if the cache cannot contain as many vertices. Some optimization algorithms do not care about vertex cache size at all, however.
The size of the post transform cache can have an effect on how you optimize your triangles. If you optimize your mesh for a large number of vertices in the cache, this mesh may get poor caching behavior if the cache cannot contain as many vertices. However, some mesh optimization algorithms can work without regard to the cache size.


== Using the cache ==
== Using the cache ==


As long as you do indexed [[Vertex Specification|vertex rendering]], you will have some chance of using it. However, there are strategies one can employ to maximize the number of cache hits one gets when rendering a mesh. These strategies are primarily for optimizing the vertex index list, though some of them can suggest changes to the vertex attribute data as well.
As long as you do indexed [[Vertex Specification|vertex rendering]], you will have some chance of using it. However, there are ways to optimize the order that vertices are submitted, to maximize the number of cache hits when rendering a mesh.
 
=== Forsyth ===
 
This algorithm, developed by Tom Forsyth is a more modern algorithm. Unlike NVTriStrip, it creates an ordered triangle list ({{enum|GL_TRIANGLES}}), not a strip. Thus, the index data may be larger than for a triangle strip.
 
Unlike most other algorithms, it does not care about the size of the cache. It is generally useful, able to get quite good performance in both small and large cache situations for most arbitrary meshes.


=== NVTriStrip ===
Details can be found [http://eelpi.gotdns.org/papers/fast_vert_cache_opt.html here].


This is a small library that NVIDIA developed quite a while ago. It takes a list of triangles to define the topology (just the indices of the vertex) and returns either a large triangle strip or a set of triangle strips.
=== Regular Grids ===


The library does have some problems. It cannot handle a set of triangles where more than 2 triangles share the same edge. Even though it is an NVIDIA library, it can work just fine for non-NVIDIA hardware, as the functions take a parameter specifying the size of the post transform cache (in number of vertices).
A regular grid is, topologically, a regular grid of vertices with triangles between adjacent vertices. Note that this is only topologically speaking; the actual positions of the vertices can be anywhere. So a landscape could be a regular grid.


=== Forsyth ===
Optimizing a regular grid for a vertex cache is somewhat easier than for a regular mesh. Details for an algorithm to do this are found [http://castano.ludicon.com/blog/2009/02/02/optimal-grid-rendering/ here]. This algorithm beats Forsyth, but is far less general, as it only works for regular grids.


This algorithm, developed by Tom Forsyth is a more modern algorithm. Unlike NVTriStrip, it creates an ordered triangle list, not a strip. Thus, the index data may be larger than for a triangle strip.
=== Hoppe ===
Hugues Hoppe proposed a local optimization algorithm that improves on the 'greedy triangle-strip' algorithm. An implementation of this algorithm was shipped in the Direct3D 9 Utility library (D3DX9) for mesh optimization.


Unlike most other algorithms, it does not care about the size of the cache. It is generally useful, able to get quite good performance in both small and large cache situations for most arbitrary meshes.
The original paper can be found [http://hhoppe.com/tvc.pdf here].


Details can be found [http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html here].
[https://github.com/Microsoft/DirectXMesh DirectXMesh] is an open-source implementation of the original D3DX9 algorithm, but does not interact with graphics hardware in any way so can be used for content for any graphics API.


=== Regular Grids ===
=== NVTriStrip ===


A regular grid is, topologically, a regular grid of vertices with triangles between adjacent vertices. Note that this is only topologically speaking; the actual positions of the vertices can be anywhere.
This is a small library that NVIDIA developed quite a while ago. It takes a list of triangles to define the topology (just the indices of the vertex) and returns either a large triangle strip or a set of triangle strips. Even though it is an NVIDIA library, it can work just fine for non-NVIDIA hardware, as the functions take a parameter specifying the size of the post transform cache (in number of vertices). It does not interact with OpenGL or the graphics hardware in any way.


Optimizing a regular grid for a vertex cache is somewhat easier than for a regular mesh. Details for an algorithm to do this are found [http://castano.ludicon.com/blog/2009/02/02/optimal-grid-rendering/ here].
The library does have some problems. It cannot handle a set of triangles where more than 2 triangles share the same edge.


[[Category:Hardware]]
[[Category:Hardware]]

Latest revision as of 06:30, 11 June 2021

The Post Transform Cache (sometimes called the "post-T&L cache") is a hardware feature that modern GPUs have to improve rendering performance. It is part of the rendering pipeline. It is a memory buffer containing vertex data that has passed through the vertex processing stage, but has not yet been converted into primitives.

Functionality

Vertex processing with vertex shaders is a very strict process. A single set of Vertex Attributes enter the vertex shader, and a single set of post-transformed data comes out. For any particular rendering call, the output of this stage is based solely on the inputs (since Uniforms cannot change within a rendering call). Therefore, if the system can detect that you have passed the same vertex attribute inputs, then the system can avoid executing the vertex shader again. Instead, if the outputs for that input attribute set is in the cache, the cached data can be used. This saves vertex processing.

In the absolute best case, you never have to process the same vertex more than once.

The test for whether a vertex is the same as a previous one is somewhat indirect. It would be impractical to test all of the user-defined attributes for inequality. So instead, a different means is used.

Two vertices are considered equal (within a single rendering command) if the vertex's index and instance count are the same (gl_VertexID and gl_InstanceID in the shader). Since vertices for non-indexed rendering are always increasing, it is not possible to use the post transform cache with non-indexed rendering.

If the vertex is in the post transform cache, then that vertex data is not necessarily even read from the input vertex arrays again. The process skips the read and vertex shader execution steps, and simply adds another copy of that vertex's post-transform data to the output stream.

Cache Size

As with any memory buffer, there is a maximum size to the post transform cache. In the early days of post transform caches, when they used fixed-function pipelines and not generic vertex attributes, the size of the cache was measured in the number of vertices it could store. In current days, since the format of a vertex is generic and variable, the caches are more traditional memory buffers. Thus the number of vertices that can be stored in the post transform cache nowadays depends on how many outputs you write from your vertex shader.

Even so, you can expect vertex shader-based hardware to allow for a fairly large number of vertices, on the order of 20+ at least.

The size of the post transform cache can have an effect on how you optimize your triangles. If you optimize your mesh for a large number of vertices in the cache, this mesh may get poor caching behavior if the cache cannot contain as many vertices. However, some mesh optimization algorithms can work without regard to the cache size.

Using the cache

As long as you do indexed vertex rendering, you will have some chance of using it. However, there are ways to optimize the order that vertices are submitted, to maximize the number of cache hits when rendering a mesh.

Forsyth

This algorithm, developed by Tom Forsyth is a more modern algorithm. Unlike NVTriStrip, it creates an ordered triangle list (GL_TRIANGLES), not a strip. Thus, the index data may be larger than for a triangle strip.

Unlike most other algorithms, it does not care about the size of the cache. It is generally useful, able to get quite good performance in both small and large cache situations for most arbitrary meshes.

Details can be found here.

Regular Grids

A regular grid is, topologically, a regular grid of vertices with triangles between adjacent vertices. Note that this is only topologically speaking; the actual positions of the vertices can be anywhere. So a landscape could be a regular grid.

Optimizing a regular grid for a vertex cache is somewhat easier than for a regular mesh. Details for an algorithm to do this are found here. This algorithm beats Forsyth, but is far less general, as it only works for regular grids.

Hoppe

Hugues Hoppe proposed a local optimization algorithm that improves on the 'greedy triangle-strip' algorithm. An implementation of this algorithm was shipped in the Direct3D 9 Utility library (D3DX9) for mesh optimization.

The original paper can be found here.

DirectXMesh is an open-source implementation of the original D3DX9 algorithm, but does not interact with graphics hardware in any way so can be used for content for any graphics API.

NVTriStrip

This is a small library that NVIDIA developed quite a while ago. It takes a list of triangles to define the topology (just the indices of the vertex) and returns either a large triangle strip or a set of triangle strips. Even though it is an NVIDIA library, it can work just fine for non-NVIDIA hardware, as the functions take a parameter specifying the size of the post transform cache (in number of vertices). It does not interact with OpenGL or the graphics hardware in any way.

The library does have some problems. It cannot handle a set of triangles where more than 2 triangles share the same edge.