Shader Magic

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
If you thought shaders were limited to special effects, photo filters, animation transitions etc. you'd be wrong:

Inigo Quilez has demonstrated on shader toy that with a bit of finesse you could write a working game in GLSL (Open GL Shader Language)...
Screen Shot 2016-04-28 at 7.48.35 PM.png

https://www.shadertoy.com/view/Ms3XWN

If you never heard of shader toy, then let me be the first to introduce you to a great way to waste time; the mind boggles with what can be done in a shader.

PS. all the shader code is typically written in GLSL (code that is run on the GPU), which means it's completely transportable to any of the major platforms with almost no or very minimal changes. This is a great place to look for a nice effect or transition for your app.
 
Last edited:

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
Nice. A single thread on a single core in a GPU should be capable of roughly what a 100-200mhz CPU could do back in the day - I wonder if someone will port Doom.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Nice. A single thread on a single core in a GPU should be capable of roughly what a 100-200mhz CPU could do back in the day - I wonder if someone will port Doom.
Yeah it's nice, and don't you think the code is brilliant in it's simplicity.

However as you should hopefully see there are quite a few instance where GLSL tends towards verbosity because of language limitations for OpenGL on the GPU e.g. switches are not globally supported, neither are many of the loops, so it can be quite challenging to write code that can run in parallel across the GPU's cores and on many different OSs

Meaning a full Doom game without CPU code is really not worth the mileage; you don't really gain anything by only running code on the GPU, but you do gain when you can assign the right transaction types to each processor group.

The top performing games typically make extensive use of all three main processor groups: CPU (game engine), GPU (renders, shaders, sound, ...) and DSP (parallel processing of datasets / mathematical calculations, ... ).
 
Last edited:

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17517286 said:
Yeah it's nice, and don't you think the code is brilliant in it's simplicity.

However as you should hopefully see there are quite a few instance where GLSL tends towards verbosity because of language limitations for OpenGL on the GPU e.g. switches are not globally supported, neither are many of the loops, so it can be quite challenging to write code that can run in parallel across the GPU's cores

Meaning a full Doom game without CPU code is really not worth the mileage; you don't really gain anything by only running code on the GPU, but you do gain when you can assign the right transaction types to each processor group.

The top performing games typically make extensive use of all three main processor groups: CPU (game engine), GPU (renders, shaders, sound, ...) and DSP (parallel processing of datasets / mathematical calculations, ... ).

Yeah - anything single threaded would be extremely inefficient in a GPU. I see this as a fun toy project, like writing an oldskool app/demo in under 4K, etc.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Yeah - anything single threaded would be extremely inefficient in a GPU. I see this as a fun toy project, like writing an oldskool app/demo in under 4K, etc.
Yeah in that form it's pretty pretty much the same thing, but it's certainly not limited to that; I use it extensively in creating custom animation renders for transitions; by building them in GLSL means I can use the same code across Android, iPhone, OSX, Windows, Ubuntu, etc. but yeah the lack of some syntax can be quite challenging and extremely verbose.

For example: I recently built a custom dither shader for an open source Core Image project; and just by the amount of if statements you can see how much I miss simple syntax like arrays, switch statements, ...
https://github.com/FlexMonkey/Filterpedia/blob/master/Filterpedia/customFilters/DitherBayer.cikernel
 

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17517670 said:
Yeah in that form it's pretty pretty much the same thing, but it's certainly not limited to that; I use it extensively in creating custom animation renders for transitions; by building them in GLSL means I can use the same code across Android, iPhone, OSX, Windows, Ubuntu, etc. but yeah the lack of some syntax can be quite challenging and extremely verbose.

For example: I recently built a custom dither shader for an open source Core Image project; and just by the amount of if statements you can see how much I miss simple syntax like arrays, switch statements, ...
https://github.com/FlexMonkey/Filterpedia/blob/master/Filterpedia/customFilters/DitherBayer.cikernel

Could the "nearest color" per-machine functions not be replaced with a single 3D texture lookup with "nearest"/"point" filtering? And the orderedDither with a single 2D texture or constant-table lookup?
 
Last edited:

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Could the "nearest color" per-machine functions not be replaced with a single 3D texture lookup with "nearest" filtering? And the orderedDither with a single 2D texture or constant-table lookup?
The issue in this case is two fold; firstly the GLSL support under Core Image is limited, secondly you have to weigh off the caching penalty; i.e. the penalty to be paid for texture fetches from cache during processing.

Whilst the code I wrote is certainly less elegant, it's simple logic and therefore doesn't impact cache.

<Edit>, here's the challenge with CoreImage re limited GLSL support:
Unsupported Items
Core Image does not support the OpenGL Shading Language source code preprocessor. In addition, the following are not implemented:
  • Data types: mat2, mat3, mat4, struct, arrays
  • Statements: continue, break, discard. Other flow control statements (if, for, while, do while) are supported only when the loop condition can be inferred at the time the code compiles.
  • Expression operators: % << >> | & ^ || && ^^ ~
  • Built-in functions: ftransform, matrixCompMult, dfdx, dfdy, fwidth, noise1, noise2, noise3, noise4, refract

Whilst the penalty would be arguably low give the limited values within the textures, the messy way ultimately ensures the best performance even on the less than stellar GPUs. i.e. good enough frame rate to apply the filter onto a realtime video feed.

The newer Apple Metal framework which doesn't use GLSL has overcome much of these limitations, however the code is then proprietary... :sick:
 
Last edited:

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17519444 said:
The issue in this case is two fold; firstly the GLSL support under Core Image is limited, secondly you have to weigh off the caching penalty; i.e. the penalty to be paid for texture fetches from cache during processing.

Whilst the code I wrote is certainly less elegant, it's simple logic and therefore doesn't impact cache.

<Edit>, here's the challenge with CoreImage re limited GLSL support:


Whilst the penalty would be arguably low give the limited values within the textures, the messy way ultimately ensures the best performance even on the less than stellar GPUs. i.e. good enough frame rate to apply the filter onto a realtime video feed.

The newer Apple Metal framework which doesn't use GLSL has overcome much of these limitations, however the code is then proprietary... :sick:

Since nearest filtering is all that is needed, you could use a small 2D texture for both. Performance depends on the specific hardware of course, but I would expect most hardware to deliver one sample per-core-per-clock out of the texture L1 cache, which should be better than the distance calculation logic.
 
Last edited:

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Since nearest filtering is all that is needed, you could use a small 2D texture for both. Performance depends on the specific hardware of course, but I would expect most hardware to deliver one sample per-core-per-clock out of the texture L1 cache, which should be better than the distance calculation logic.
Yeah I guess, we're not talking about a large texture anyway.

As to being better than the distance calculation; sorry I still don't agree as with that you pay each time a pixel is processed; granted we're talking high performing cores and fast cache RAM, but there's still a continued hit versus the ugly duckling only taking a hit once at compile time.

But in all honesty we'd probably have to look at the resulting assembly code to determine exactly how much the difference is i.e. how effective was the compiler in reducing the ugly duckling's logic instruction set vs. the continual cache hit of the texture.

For such a rudimentary example it's probably just not worth the added brain stress to be sure. Yet I'm still quite confident per pixel cache resolves are little more expensive vs. any additional core instructions for the ugly duckling.

Conclusion:
I guess if you prefer shorter kernel code then sure go with your solution, but keep in mind it does add more variation on the CoreImage side, meaning you would need a mod to accommodate for the texture in the Filterpedia project. The other benefit of the ugly duckling is that all the "ugly" logic is visible in the code i.e. corrections are simple +just to be pedantic we're also implicitly assuming theres no variation of colour profiles between the targeted OSs (yet they are very basic palettes so it probably doesn't matter).

Whew analysis paralysis....:sick:

Hey... another issue just popped into my mind... sorry I just don't seem to want to give up.
If we look at the values we want to store, the results are single precision floating point numbers (includes values after the decimal for anyone else).
So simply storing results in a texture will result in the unintentional loss of the decimals re A8R8G8B8 packing. That doesn't mean we can't still use textures(we can), but it does mean that we now have to retrieve 2 values instead of 1 i.e. to be able to do the calculation after reading them from the cache. Again you have decide whether this is all worth the effort; sorry I'm now even less convinced it is.
 
Last edited:

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17519862 said:
Yeah I guess, we're not talking about a large texture anyway.

As to being better than the distance calculation; sorry I still don't agree as with that you pay each time a pixel is processed; granted we're talking high performing cores and fast cache RAM, but there's still a continued hit versus the ugly duckling only taking a hit once at compile time.

But in all honesty we'd probably have to look at the resulting assembly code to determine exactly how much the difference is i.e. how effective was the compiler in reducing the ugly duckling's logic instruction set vs. the continual cache hit of the texture.

For such a rudimentary example it's probably just not worth the added brain stress to be sure. Yet I'm still quite confident per pixel cache resolves are little more expensive vs. any additional core instructions for the ugly duckling.

Conclusion:
I guess if you prefer shorter kernel code then sure go with your solution, but keep in mind it does add more variation on the CoreImage side, meaning you would need a mod to accommodate for the texture in the Filterpedia project. The other benefit of the ugly duckling is that all the "ugly" logic is visible in the code i.e. corrections are simple +just to be pedantic we're also implicitly assuming theres no variation of colour profiles between the targeted OSs (yet they are very basic palettes so it probably doesn't matter).

Whew analysis paralysis....:sick:

It's not a compile time hit, it's a run time hit - those branches, predicates or conditional moves and instructions determining the condition will all have to be executed at run-time, creating a much longer running kernel. You're basically doing in software what efficient hardware was created to do. The cost of an L1 cache hit is negligible and can run in parallel with the code execution, it is certainly far better than the tens of additional instructions being introduced into the critical path. BTW, even though the "fast enough" argument may apply, there can be a significant power saving.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
It's not a compile time hit, it's a run time hit - those branches, predicates or conditional moves and instructions determining the condition will all have to be executed at run-time, creating a much longer running kernel. You're basically doing in software what efficient hardware was created to do. The cost of an L1 cache hit is negligible and can run in parallel with the code execution, it is certainly far better than the tens of additional instructions being introduced into the critical path. BTW, even though the "fast enough" argument may apply, there can be a significant power saving.
You missing the jmp instruction i.e. cost only incurred up to the jump + you're assuming the compiler has done nothing to simplify the code. On the cache I still don't agree, you assuming you have no calculations after the values are retrieved from the cache; as I mentioned you cannot assume you are storing float values for error & the reduced colour palettes in the texture; only the very latest GPUs support this and only on OpenGL 4.0 above. CoreImage is not that; it is compatible with GLSL but it is not OpenGL 4.0+

In a perfect world with OpenGL 4.0+, fast GPU & cache you probably would be right, but let's be honest with that being available I'll simply be using an array, loops and a switch.
 

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17519862 said:
Yeah I guess, we're not talking about a large texture anyway.

As to being better than the distance calculation; sorry I still don't agree as with that you pay each time a pixel is processed; granted we're talking high performing cores and fast cache RAM, but there's still a continued hit versus the ugly duckling only taking a hit once at compile time.

But in all honesty we'd probably have to look at the resulting assembly code to determine exactly how much the difference is i.e. how effective was the compiler in reducing the ugly duckling's logic instruction set vs. the continual cache hit of the texture.

For such a rudimentary example it's probably just not worth the added brain stress to be sure. Yet I'm still quite confident per pixel cache resolves are little more expensive vs. any additional core instructions for the ugly duckling.

Conclusion:
I guess if you prefer shorter kernel code then sure go with your solution, but keep in mind it does add more variation on the CoreImage side, meaning you would need a mod to accommodate for the texture in the Filterpedia project. The other benefit of the ugly duckling is that all the "ugly" logic is visible in the code i.e. corrections are simple +just to be pedantic we're also implicitly assuming theres no variation of colour profiles between the targeted OSs (yet they are very basic palettes so it probably doesn't matter).

Whew analysis paralysis....:sick:

Hey... another issue just popped into my mind... sorry I just don't seem to want to give up.
If we look at the values we want to store, the results are single precision floating point numbers (includes values after the decimal for anyone else).
So simply storing results in a texture will result in the unintentional loss of the decimals re A8R8G8B8 packing. That doesn't mean we can't still use textures(we can), but it does mean that we now have to retrieve 2 values instead of 1 i.e. to be able to do the calculation after reading them from the cache. Again you have decide whether this is all worth the effort; sorry I'm now even less convinced it is.

Re: last point. The values being stored, given that they are color values should all have the form X/255 for some integer X. Texture hardware is capable of converting from and integer in the range 0-255 to a floating point number between 0 and 1.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Re: last point. The values being stored, given that they are color values should all have the form X/255 for some integer X. Texture hardware is capable of converting from and integer in the range 0-255 to a floating point number between 0 and 1.
I suggest you check, not all current GPUs can support 2d textures with floating colour values; CoreImage certainly doesn't as far as I know + remember it's part of OpenGL 4; which excludes many iDevices, Macs and quite a lot of Android devices.

As to the calculation you are retrieving A8R8G8B8 from the texture (I'm ignoring floating point support); two of these will represent the values to calculate the error; you then convert int8 to float and do the division; the palette colours would simply be stored as A8R8G8B8, however you still pay a penalty with the cache lookup from vec4 (float) to texture2d (A8R8G8B8)
 
Last edited:

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17520136 said:
You missing the jmp instruction i.e. cost only incurred up to the jump + you're assuming the compiler has done nothing to simplify the code. On the cache I still don't agree, you assuming you have no calculations after the values are retrieved from the cache; as I mentioned you cannot assume you are storing float values for error & the reduced colour palettes in the texture; only the very latest GPUs support this and only on OpenGL 4.0 above. CoreImage is not that; it is compatible with GLSL but it is not OpenGL 4.0+

In a perfect world with OpenGL 4.0+, fast GPU & cache you probably would be right, but let's be honest with that being available I'll simply be using an array, loops and a switch.

The cost isn't just what is run up to the jump instruction, this is a SIMT architecture, so the threads don't branch separately, and a uniform (true warp branch) branch won't be emitted for such short blocks. Also, most of the work (distance function) isn't predicated by a jump at all. The compiler can't simplify that code, since the color being mapped isn't know at compile time. I am not assuming floats are being stored.
 
Last edited:

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17520202 said:
I suggest you check, not all current GPUs can support 2d textures with floating colour values; CoreImage certainly doesn't as far as I know + remember it's part of OpenGL 4; which excludes many iDevices, Macs and quite a lot of Android devices.

As to the calculation you are retrieving A8R8G8B8 from the texture (I'm ignoring floating point support); two of these will represent the values to calculate the error; you then convert int8 to float and do the division; the palette colours would simply be stored as A8R8G8B8, however you still pay a penalty with the cache lookup from vec4 (float) to texture2d (A8R8G8B8)

You don't need float textures. ARGB8 to float4 direct conversion from the texture fetch has been supported since the beginning of time.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
It's not a compile time hit, it's a run time hit - those branches, predicates or conditional moves and instructions determining the condition will all have to be executed at run-time, creating a much longer running kernel. You're basically doing in software what efficient hardware was created to do. The cost of an L1 cache hit is negligible and can run in parallel with the code execution, it is certainly far better than the tens of additional instructions being introduced into the critical path. BTW, even though the "fast enough" argument may apply, there can be a significant power saving.
Sure it's both (compile and execution) but you're assuming 3 things:
  • The compiler won't be able to simplfy the instructions i.e. nothing is optimised out.
  • It always can't be resolved in 1 instruction i.e. it will always branch multiple times
  • The penalty of the branches exceeds the cost to convert A8R8G8B8 to float (from the texture) to calculate the error.
Finally you assume all GPUs are SIMT, some are SIMD.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
You don't need float textures. ARGB8 to float4 direct conversion from the texture fetch has been supported since the beginning of time.
You misunderstand; the compiler will take this code "error = 49.0 / 65.0;" and only store the float result. Using the texture you store the original values and have to perform the calculation to arrive at the float error value including decimals. If you could use float textures you would naturally avoid these extra steps.
 

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17520272 said:
Sure it's both (compile and execution) but you're assuming 3 things:
  • The compiler won't be able to simplfy the instructions i.e. nothing is optimised out.
  • It always can't be resolved in 1 instruction i.e. it will always branch multiple times
  • The penalty of the branches exceeds the cost to convert A8R8G8B8 to float (from the texture) to calculate the error.
Finally you assume all GPUs are SIMT, some are SIMD.

Those aren't assumptions sunshine, and SIMD has the same issue.
 

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Those aren't assumptions sunshine.
Hahaha... sorry but I don't agree. To progress this tennis match we'll have to revert to a break down of the instruction for the compiled kernel, re you're assuming none of the branch code or calculations are optimised out i.e. for you it always branches + it appears you're ignoring the extra penalty for A8R8G8B8 to calculate the error, or to perform the palette lookup.

If you've got Xcode we could always run this through instrumentation:
  • OpenGL analysis should be able to at least confirm the comparitive runtime performance, etc.
  • It runs quite well on device as well; so I can at least confirm the outcome on e.g. an iPad.

Do you have a Mac? or do you know of another way to resolve this. The Android tools unfortunately suck in this area and I really haven't ever looked into Microsoft's Visual Studio capabilities in this regard.
 
Last edited:

cguy

Executive Member
Joined
Jan 2, 2013
Messages
8,527
[)roi(];17520328 said:
Hahaha... sorry but I don't agree. To progress this tennis match we'll have to revert to a break down of the instruction for the compiled kernel, re you're assuming none of the branch code or calculations are optimised out i.e. for you it always branches + it appears you're ignoring the extra penalty for A8R8G8B8 to calculate the error, or to perform the palette lookup.

You're not even using the right terminology. What do you mean by "optimized out" - the compiler doesn't have advance knowledge of the data that will be run so it can't optimize out anything we've discussed. Do you perhaps mean "early out" or "skip" paths not taken? This won't happen either because this is a SIMT/D architecture, which will rather execute both sides of the branch and predicate the results for short code blocks. The error function can easily be scaled up to X/255, which will either have hardware to normalize it to 0-1, or will simply require that it be multiplied by 1/255, which is much faster than running through a flattened condition tree.
 
Top